Effective learning rate and batch size with Lightning in DDP

awaelchli · August 30, 2020, 9:56pm

expanding on why batch sizes are “handled differently” in distributed training:

ddp is special in that it spawns multiple parallel processes that train independently, with their own data. We want to keep the batch_size as the user has set it in their dataloaders, because that’s what each gpu will see. compare this to dp: there the batch gets split up into N pieces (scattered) and then after forward collected again (gather), there all gpus work with the same data, but that means this does not scale very well, because as you add more gpus, you will have to recompute the batch size so that you get the desired learning behaviour plus fitting the data into the memory. with ddp, you don’t have that problem, as @teddy explained already, it makes hyperparameter tuning easier when scaling to more devices.

proportionally, because the larger the batch size, the more accurate is your estimate of the gradient over the entire data. that’s my understanding.