Effective learning rate and batch size with Lightning in DDP

Only by doing this do they make switching backends equivalent, and the official MoCo code follows from that.

I agree with this completely. If you would like to maintain the same effective batch size across backends you will need to say batch_size = batch_size / n_gpus.

With regards to learning rate I believe both the Imagenet and MoCo implementations are not correctly backend agnostic. The MoCo repository claims “similar results” with half the gpus, half effective batch size, and half given learning rate (which means essential the same effective rate, but smaller batch size). A .5x change in batch size with the same learning rate will likely not change much, so I am not surprised they are able to get similar results in this manner.