Effective learning rate and batch size with Lightning in DDP

@sm000 I agree that this can be confusing if you are trying to reproduce the results of a paper. If the paper states it uses a batch size of 64 with a learning rate of 0.01, but you can only fit a per-gpu batch size of 8 (with 8 gpus), you must provide a learning rate of 0.01 * 8 = 0.08 to your optimizer.

On the other hand, if you already have a model that trains well on a single gpu with a given batch size and learning rate, you can likely increase the number of gpus without changing any hyperparameters as your effective learning rate and batch size will change linearly. I believe this is why this is the default behavior.

The extra issue here is that for contrastive methods such as SimCLR that rely heavily upon negative sampling, batch size plays a much bigger role than it would for most other tasks, and I am not sure if there is a good rule to follow when it comes to hyperparameter selection for scaling.