Effective learning rate and batch size with Lightning in DDP

Great question! As you mention, when you use DDP over N gpu’s, your effective batch_size is (N x batch size). After summing the gradients from each gpu DDP divides the gradients by N, so the effective learning rate would be learning_rate / N.

2 Likes