Effective learning rate and batch size with Lightning in DDP

teddy · August 31, 2020, 10:35pm

Regarding the Lightning Moco repo code, it makes sense that they now use the same learning rate as the official Moco repository, as both use DDP. Each model now has as per-gpu batch size of 32, and a per-gpu learning rate of 0.03. Not sure what changed since 0.7.1, maybe @williamfalcon has some insight.

Now lets say you wanted to train the same model on one gpu, with a batch size of 256. Now you would have to adjust your learning rate to be 0.03 / 8 = 0.00375. Why is this?

Lets say we have an effective batch of 256 images that produces a gradient g, and we have a learning rate lr

For the case of 8 gpus, the per-gpu gradient becomes g/8 and our parameter delta is lr x g/8.

For the case of 1 gpu, the per-gpu gradient is now just g and our parameter delta is lr x g.

If we want to make these parameter deltas consistent, we would either have to divide our learning rate by 8 for the single gpu case, or multiply the learning rate by 8 for the multi gpu case. Which of these options we choose depends on the situation. In the case of MoCo, they show that a learning rate of 0.03 works when the per-gpu batch size is 32, so we have to work backwards to find that a learning rate of lr / 8 should be used if training the full 256-image batch on one gpu.