Effective learning rate and batch size with Lightning in DDP

sm000 · August 31, 2020, 7:57pm

I’m also still kind of confused, along similar lines to what @goku said. I’m starting to think the effective learning rate is dependent on the local batch size, rather than the effective/cumulative one (see Should we split batch_size according to ngpu_per_node when DistributedDataparallel - #4 by mrshenli - distributed - PyTorch Forums linked from before). That is, does the effective learning rate really change given the way averaging is done for DDP? This link suggests it does not, but I’m finding it difficult to parse.

In an interesting twist, I asked the authors of the Lightning Moco code repo I’d linked above why they scaled the learning rate and they said that actually, this scaling was needed in Lightning 0.7.1 but is no longer needed - i.e., they had to use 0.03*8=0.24 before for 8 gpus, but now 0.03 works (apparently). Any idea what could’ve changed between then and now?