Effective learning rate and batch size with Lightning in DDP

sm000 · September 1, 2020, 7:58pm

Thanks for the example @teddy, that matches my understanding for the case where per-GPU batch size changes but effective batch size stays constant. This has been a really helpful discussion. (Currently I’m basically thinking along the lines of what @goku said.) I just wanted to clarify a couple of minor things.

To get this straight: in this example, they’re using half the effective batch size with half the GPUs, so their per-GPU batch size (what each process sees) stays the same (256/8 = 128/4 = 32) since they’re scaling it. The given learning rate for DDP (unlike DP) should correspond to the per-GPU (given) batch size if I understand correctly now. So why would half the given learning rate mean the same effective learning rate? To keep things the same, should we keep the same learning rate? Is it correct to say that when we apply the linear scaling heuristic for DDP, this is with respect to the per-GPU batch size, not the effective batch size?

Why would one want to scale the batch size given to DDP (like the ImageNet example/Moco do) by the number of GPUs? My understanding is Lightning chooses not to scale for the user since the other hyperparameters (learning rate in particular) mostly vary with the per-GPU/given batch size and not the effective batch size.