Effective learning rate and batch size with Lightning in DDP

sm000 · August 30, 2020, 11:50pm

I realized one point of confusion stems from the official PyTorch DDP example code (for ImageNet) - it turns out they manually scale batch_size to batch_size/n_gpus_per_node when using DDP with 1 GPU per process (https://github.com/pytorch/examples/blob/master/imagenet/main.py#L151), recommending

# When using a single GPU per process and per
# DistributedDataParallel, we need to divide the batch size
# ourselves based on the total number of GPUs we have

Only by doing this do they make switching backends equivalent, and the official MoCo code follows from that (it’s just a fork of that example https://github.com/facebookresearch/moco/blob/master/main_moco.py#L174). I agree that not scaling is default behavior for nn.parallel, but this yields substantially different behavior across backends. I’m wondering if users of Lightning switching between different distributed backends and expecting similar behavior could be thrown off by this (I definitely have been!), since scaling seems to be necessary to match behavior across backends.

On the other hand, I’m still relatively confused about the learning rate. @teddy mentioned

you can likely increase the number of gpus without changing any hyperparameters as your effective learning rate and batch size will change linearly

but it seems that they’ll change in the wrong directions - we want learning rate to increase with batch size, but effective batch size goes up with n_gpus and effective learning rate goes down with n_gpus if it’s being divided. (Actually, looking further into the PyTorch source link, it seems like the division is only being done to make the sum an average, keeping effective learning rate the same per Should we split batch_size according to ngpu_per_node when DistributedDataparallel - #4 by mrshenli - distributed - PyTorch Forums - any thoughts on this?) To compensate would we need to multiply learning rate by n_gpus**2? Actually, in the ImageNet example they don’t seem to mess with the learning rate at all, so is it all accounted for as long as we scale the batch size per gpu? This behavior actually seems to be different in Lightning, where the link in my first comment notes you need to multiply by n_gpus. I’m just hypothesizing, but could there be something different about how in the ImageNet example code they call the optimizer once on model.parameters after DDP vs how configure_optimizers works?

I realize this is kind of a long post (tried to err on the side of providing too much info rather than too little), sorry and thanks again for the input.