Effective learning rate and batch size with Lightning in DDP

thanks, that’s helpful. Though I’m still a bit confused - it seems like I’d have to modify hyperparameters more, since to get the same (global) behavior in ddp as in single-gpu training I need to divide the batch_size I specify and multiply the learning_rate I specify by N. Empirically, naively leaving both the same and trying ddp doesn’t seem to be effective.

As a concrete code example, please see the link I posted above - as described in their README, to reproduce MoCo’s results, they multiply the paper’s learning rate, 0.03, by 8 to get 0.24, and divide the batch_size, 256, by 8 to get 32. (If you don’t do this, the results are substantially worse.)

I also had some confusion about what you mentioned with learning rate vs batch size - my impression is that we should increase the learning rate when we use a larger effective batch size, for example by the linear scaling rule (section 2.1 of https://arxiv.org/pdf/1706.02677.pdf) or by the square root (as in the recent Lightning SimCLR implementation). But this behavior reduces learning rate, so we end up needing to compensate twice?

(@teddy posted his comment about linear scaling while I was writing this, glad we’re thinking along similar lines :slight_smile: )