Effective learning rate and batch size with Lightning in DDP

teddy · August 30, 2020, 8:53pm

Great question! As you mention, when you use DDP over N gpu’s, your effective batch_size is (N x batch size). After summing the gradients from each gpu DDP divides the gradients by N, so the effective learning rate would be learning_rate / N.