Launch distributed training¶

To run your code distributed across many devices and many machines, you need to do two things:

Configure Fabric with the number of devices and number of machines you want to use
Launch your code in multiple processes

Simple Launch¶

You can configure and launch processes on your machine directly with Fabric’s launch() method:

# train.py
...

# Configure accelerator, devices, num_nodes, etc.
fabric = Fabric(devices=4, ...)

# This launches itself into multiple processes
fabric.launch()

In the command line, you run this like any other Python script:

python train.py

This is the recommended way for running on a single machine and is the most convenient method for development and debugging.

It is also possible to use Fabric in a Jupyter notebook (including Google Colab, Kaggle, etc.) and launch multiple processes there. You can learn more about it here.

Launch with the CLI¶

An alternative way to launch your Python script in multiple processes is to use the dedicated command line interface (CLI):

fabric run path/to/your/script.py

This is essentially the same as running python path/to/your/script.py, but it also lets you configure the following settings externally without changing your code:

--accelerator: The accelerator to use
--devices: The number of devices to use (per machine)
--num_nodes: The number of machines (nodes) to use
--precision: Which type of precision to use
--strategy: The strategy (communication layer between processes)

fabric run --help

Usage: fabric run [OPTIONS] SCRIPT [SCRIPT_ARGS]...

  Run a Lightning Fabric script.

  SCRIPT is the path to the Python script with the code to run. The script
  must contain a Fabric object.

  SCRIPT_ARGS are the remaining arguments that you can pass to the script
  itself and are expected to be parsed there.

Options:
  --accelerator [cpu|gpu|cuda|mps|tpu]
                                  The hardware accelerator to run on.
  --strategy [ddp|dp|deepspeed]   Strategy for how to run across multiple
                                  devices.
  --devices TEXT                  Number of devices to run on (``int``), which
                                  devices to run on (``list`` or ``str``), or
                                  ``'auto'``. The value applies per node.
  --num-nodes, --num_nodes INTEGER
                                  Number of machines (nodes) for distributed
                                  execution.
  --node-rank, --node_rank INTEGER
                                  The index of the machine (node) this command
                                  gets started on. Must be a number in the
                                  range 0, ..., num_nodes - 1.
  --main-address, --main_address TEXT
                                  The hostname or IP address of the main
                                  machine (usually the one with node_rank =
                                  0).
  --main-port, --main_port INTEGER
                                  The main port to connect to the main
                                  machine.
  --precision [16-mixed|bf16-mixed|32-true|64-true|64|32|16|bf16]
                                  Double precision (``64-true`` or ``64``),
                                  full precision (``32-true`` or ``32``), half
                                  precision (``16-mixed`` or ``16``) or
                                  bfloat16 precision (``bf16-mixed`` or
                                  ``bf16``)
  --help                          Show this message and exit.

Here is how you run DDP with 8 GPUs and torch.bfloat16 precision:

fabric run ./path/to/train.py \
    --strategy=ddp \
    --devices=8 \
    --accelerator=cuda \
    --precision="bf16"

Or DeepSpeed Zero3 with mixed precision:

fabric run ./path/to/train.py \
   --strategy=deepspeed_stage_3 \
   --devices=8 \
   --accelerator=cuda \
   --precision=16

Fabric can also figure it out automatically for you!

fabric run ./path/to/train.py \
    --devices=auto \
    --accelerator=auto \
    --precision=16

Launch on a Cluster¶

Fabric enables distributed training across multiple machines in several ways. Choose from the following options based on your expertise level and available infrastructure.

Run single or multi-node on Lightning Studios

The easiest way to scale models in the cloud. No infrastructure setup required.

basic

SLURM Managed Cluster

Most popular for academic and private enterprise clusters.

intermediate

Bare Bones Cluster

Train across machines on a network using `torchrun`.

advanced

Other Cluster Environments

MPI, LSF, Kubeflow

advanced

Next steps¶

Mixed Precision Training

Save memory and speed up training using mixed precision

basic

Distributed Communication

Learn all about communication primitives for distributed operation. Gather, reduce, broadcast, etc.

advanced