Hi,
On DDP the documentation states that its recommended that we multiply the batch size by the number of gpus, but on pytorch lightning 2.1. 3 using ddp, it automatically infers and multiplies the batch size to the number of gpus. Is this a bug, or it was intended, if it is intentional please direct me to the latest documentation. My goal is to have a baseline where, if i train a model using one gpu or more gpus using ddp, i should relatively get similar results. Also please advise how to handle batch size and learning rate when switching from one gpu to ddp
Please help
Hey @paxandfidem
That’s correct, when you use DDP then the batch size you set in the dataloader is always local to the GPU of one process. The total global batch size would be N * batch_size and so in that sense it scales automatically.
My goal is to have a baseline where, if i train a model using one gpu or more gpus using ddp, i should relatively get similar results.
If you want to do this, then do two experiments. First, choose the number B. Then:
- Train with single GPU with batch size B.
- Train with N GPUs with batch size B/N
(Make sure that B is divisible by N of course). This way, the global batch size of both experiments is the same. You should get approximately the same loss values. The only reason why it wouldn’t be exact is because samples get allocated to the batches in a different order.
I can improve the text in the docs if you can point me to it.