On the pytorch lightning documentation it is mentioned that lightning automatically handle multi-node training. However, I run the same script on a single-node 2-gpu machine and multi-node 4-gpu cluster and I see the training on single-machine is 2X faster than multi-node cluster. Specifically, I see differences in the number of steps in each epoch in each environment. In particular, for the same dataset, the single-node 2-gpu machine has 6616 training steps but the multi-node 4-gpu cluster has 13232 steps.