GPU training (FAQ)

How should I adjust the learning rate when using multiple devices?

When using distributed training make sure to modify your learning rate according to your effective batch size.

Let’s say you have a batch size of 7 in your dataloader.

class LitModel(LightningModule):
    def train_dataloader(self):
        return Dataset(..., batch_size=7)

Whenever you use multiple devices and/or nodes, your effective batch size will be 7 * devices * num_nodes.

# effective batch size = 7 * 8
Trainer(accelerator="gpu", devices=8, strategy=...)

# effective batch size = 7 * 8 * 10
Trainer(accelerator="gpu", devices=8, num_nodes=10, strategy=...)


Huge batch sizes are actually really bad for convergence. Check out: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

How do I use multiple GPUs on Jupyter or Colab notebooks?

To use multiple GPUs on notebooks, use the DDP_NOTEBOOK mode.

Trainer(accelerator="gpu", devices=4, strategy="ddp_notebook")

If you want to use other strategies, please launch your training via the command-shell. See also: Interactive Notebooks (Jupyter, Colab, Kaggle)