Effective learning rate and batch size with Lightning in DDP

I believe they should in fact keep the same learning rate in this case, since the per-GPU batch size is the same. They may have overlooked this, but as I mentioned before a 2x change in learning rate will probably not change much.

The only reason I can think to scale the batch size is to be consistent with a paper. If the paper reports a batch size of 256 but you can only use 32, it may appear cleaner to still say you have a batch size of 256, but over 8 gpus.

Exactly. In this way you can scale up without having to change any hyperparameters.

2 Likes