GPU training (Intermediate)

Audience: Users looking to train across machines or experiment with different scaling techniques.


Distributed Training strategies

Lightning supports multiple ways of doing distributed training.

  • DistributedDataParallel (multiple-gpus across many machines)
    • Regular (strategy='ddp')

    • Spawn (strategy='ddp_spawn')

    • Notebook/Fork (strategy='ddp_notebook')

Note

If you request multiple GPUs or nodes without setting a strategy, DDP will be automatically used.

For a deeper understanding of what Lightning is doing, feel free to read this guide.

Distributed Data Parallel

DistributedDataParallel (DDP) works as follows:

  1. Each GPU across each node gets its own process.

  2. Each GPU gets visibility into a subset of the overall dataset. It will only ever see that subset.

  3. Each process inits the model.

  4. Each process performs a full forward and backward pass in parallel.

  5. The gradients are synced and averaged across all processes.

  6. Each process updates its optimizer.

# train on 8 GPUs (same machine (ie: node))
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp")

# train on 32 GPUs (4 nodes)
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp", num_nodes=4)

This Lightning implementation of DDP calls your script under the hood multiple times with the correct environment variables:

# example for 3 GPUs DDP
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=0 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=1 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc
MASTER_ADDR=localhost MASTER_PORT=random() WORLD_SIZE=3 NODE_RANK=2 LOCAL_RANK=0 python my_file.py --accelerator 'gpu' --devices 3 --etc

We use DDP this way because ddp_spawn has a few limitations (due to Python and PyTorch):

  1. Since .spawn() trains the model in subprocesses, the model on the main process does not get updated.

  2. Dataloader(num_workers=N), where N is large, bottlenecks training with DDP… ie: it will be VERY slow or won’t work at all. This is a PyTorch limitation.

  3. Forces everything to be picklable.

There are cases in which it is NOT possible to use DDP. Examples are:

  • Jupyter Notebook, Google COLAB, Kaggle, etc.

  • You have a nested script without a root package

In these situations you should use ddp_notebook or dp instead.

Distributed Data Parallel Spawn

ddp_spawn is exactly like ddp except that it uses .spawn to start the training processes.

Warning

It is STRONGLY recommended to use DDP for speed and performance.

mp.spawn(self.ddp_train, nprocs=self.num_processes, args=(model,))

If your script does not support being called from the command line (ie: it is nested without a root project module) you can use the following method:

# train on 8 GPUs (same machine (ie: node))
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp_spawn")

We STRONGLY discourage this use because it has limitations (due to Python and PyTorch):

  1. The model you pass in will not update. Please save a checkpoint and restore from there.

  2. Set Dataloader(num_workers=0) or it will bottleneck training.

ddp is MUCH faster than ddp_spawn. We recommend you

  1. Install a top-level module for your project using setup.py

# setup.py
#!/usr/bin/env python

from setuptools import setup, find_packages

setup(
    name="src",
    version="0.0.1",
    description="Describe Your Cool Project",
    author="",
    author_email="",
    url="https://github.com/YourSeed",  # REPLACE WITH YOUR OWN GITHUB PROJECT LINK
    install_requires=["lightning"],
    packages=find_packages(),
)
  1. Setup your project like so:

/project
    /src
        some_file.py
        /or_a_folder
    setup.py
  1. Install as a root-level package

cd /project
pip install -e .

You can then call your scripts anywhere

cd /project/src
python some_file.py --accelerator 'gpu' --devices 8 --strategy 'ddp'

Distributed Data Parallel in Notebooks

DDP Notebook/Fork is an alternative to Spawn that can be used in interactive Python and Jupyter notebooks, Google Colab, Kaggle notebooks, and so on: The Trainer enables it by default when such environments are detected.

# train on 8 GPUs in a Jupyter notebook
trainer = Trainer(accelerator="gpu", devices=8)

# can be set explicitly
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp_notebook")

# can also be used in non-interactive environments
trainer = Trainer(accelerator="gpu", devices=8, strategy="ddp_fork")

Among the native distributed strategies, regular DDP (strategy="ddp") is still recommended as the go-to strategy over Spawn and Fork/Notebook for its speed and stability but it can only be used with scripts.

Comparison of DDP variants and tradeoffs

DDP variants and their tradeoffs

DDP

DDP Spawn

DDP Notebook/Fork

Works in Jupyter notebooks / IPython environments

No

No

Yes

Supports multi-node

Yes

Yes

Yes

Supported platforms

Linux, Mac, Win

Linux, Mac, Win

Linux, Mac

Requires all objects to be picklable

No

Yes

No

Limitations in the main process

None

The state of objects is not up-to-date after returning to the main process (Trainer.fit() etc). Only the model parameters get transferred over.

GPU operations such as moving tensors to the GPU or calling torch.cuda functions before invoking Trainer.fit is not allowed.

Process creation time

Slow

Slow

Fast

Distributed and 16-bit precision

Below are the possible configurations we support.

1 GPU

1+ GPUs

DDP

16-bit

command

Y

Trainer(accelerator=”gpu”, devices=1)

Y

Y

Trainer(accelerator=”gpu”, devices=1, precision=16)

Y

Y

Trainer(accelerator=”gpu”, devices=k, strategy=’ddp’)

Y

Y

Y

Trainer(accelerator=”gpu”, devices=k, strategy=’ddp’, precision=16)

DDP can also be used with 1 GPU, but there’s no reason to do so other than debugging distributed-related issues.

Implement Your Own Distributed (DDP) training

If you need your own way to init PyTorch DDP you can override lightning.pytorch.strategies.ddp.DDPStrategy.setup_distributed().

If you also need to use your own DDP implementation, override lightning.pytorch.strategies.ddp.DDPStrategy.configure_ddp().


Torch Distributed Elastic

Lightning supports the use of Torch Distributed Elastic to enable fault-tolerant and elastic distributed job scheduling. To use it, specify the ‘ddp’ backend and the number of GPUs you want to use in the trainer.

Trainer(accelerator="gpu", devices=8, strategy="ddp")

To launch a fault-tolerant job, run the following on all nodes.

python -m torch.distributed.run
        --nnodes=NUM_NODES
        --nproc_per_node=TRAINERS_PER_NODE
        --rdzv_id=JOB_ID
        --rdzv_backend=c10d
        --rdzv_endpoint=HOST_NODE_ADDR
        YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)

To launch an elastic job, run the following on at least MIN_SIZE nodes and at most MAX_SIZE nodes.

python -m torch.distributed.run
        --nnodes=MIN_SIZE:MAX_SIZE
        --nproc_per_node=TRAINERS_PER_NODE
        --rdzv_id=JOB_ID
        --rdzv_backend=c10d
        --rdzv_endpoint=HOST_NODE_ADDR
        YOUR_LIGHTNING_TRAINING_SCRIPT.py (--arg1 ... train script args...)

See the official Torch Distributed Elastic documentation for details on installation and more use cases.

Optimize multi-machine communication

By default, Lightning will select the nccl backend over gloo when running on GPUs. Find more information about PyTorch’s supported backends here.

Lightning allows explicitly specifying the backend via the process_group_backend constructor argument on the relevant Strategy classes. By default, Lightning will select the appropriate process group backend based on the hardware used.

from lightning.pytorch.strategies import DDPStrategy

# Explicitly specify the process group backend if you choose to
ddp = DDPStrategy(process_group_backend="nccl")

# Configure the strategy on the Trainer
trainer = Trainer(strategy=ddp, accelerator="gpu", devices=8)