Training models with billions of parameters¶

Today, large models with billions of parameters are trained with many GPUs across several machines in parallel. Even a single H100 GPU with 80 GB of VRAM (one of the biggest today) is not enough to train just a 30B parameter model (even with batch size 1 and 16-bit precision). The memory consumption for training is generally made up of

the model parameters,
the layer activations (forward),
the gradients (backward),
the optimizer states (e.g., Adam has two additional exponential averages per parameter) and
model outputs and loss.

When the sum of these memory components exceed the VRAM of a single GPU, regular data-parallel training (DDP) can no longer be employed. To alleviate this limitation, we need to introduce Model Parallelism.

What is Model Parallelism?¶

There are different types of model parallelism, each with its own trade-offs.

Fully Sharded Data Parallelism (FSDP) shards both model parameters and optimizer states across multiple GPUs, significantly reducing memory usage per GPU. This method, while highly memory-efficient, involves frequent synchronization between GPUs, introducing communication overhead and complexity in implementation. FSDP is advantageous when memory constraints are the primary issue, provided there are high-bandwidth interconnects to minimize latency.

Tensor Parallelism (TP) splits individual tensors across GPUs, enabling fine-grained distribution of computation and memory. It scales well to a large number of GPUs but requires synchronization of tensor slices after each operation, which adds communication overhead. TP is most effective with models that have many linear layers (LLMs), offering a balance between memory distribution and computational efficiency.

Pipeline Parallelism (PP) divides model layers into segments, each processed by different GPUs, reducing memory load per GPU and minimizing inter-GPU communication to pipeline stage boundaries. While this reduces communication overhead, it can introduce pipeline bubbles where some GPUs idle, leading to potential inefficiencies. PP is ideal for deep models with sequential architectures (LLMs), though it requires careful management to minimize idle times.

Choosing a model parallelism style involves considering model architecture, hardware interconnects, and training efficiency. In practice, hybrid approaches combining FSDP, TP, and PP are often used to leverage the strengths of each method while mitigating their weaknesses.

Get started¶

Fully-Sharded Data Parallel (FSDP)

Get started training large multi-billion parameter models with minimal code changes

advanced

Tensor Parallel (TP)

Learn the principles behind tensor parallelism and how to apply it to your model

advanced

2D Parallel (FSDP + TP)

Combine Tensor Parallelism with FSDP (2D Parallel) to train efficiently on 100s of GPUs

advanced

Pipeline Parallelism

Coming soon

advanced

Parallelisms compared¶

Distributed Data Parallel (DDP)

✅ No model code changes required
✅ Training with very large batch sizes (batch size scales with number of GPUs)
❗ Model (weights, optimizer state, activations / gradients) must fit into a GPU

Fully-Sharded Data Parallel (FSDP)

✅ No model code changes required
✅ Training with very large batch sizes (batch size scales with number of GPUs)
✅ Model (weights, optimizer state, gradients) gets distributed across all GPUs
❗ A single FSDP layer when gathered during forward/backward must fit into the GPU
❗ Requires some knowledge about model architecture to set configuration options correctly
❗ Requires very fast networking (multi-node), data transfers between GPUs often become a bottleneck

Tensor Parallel (TP)

❗ Model code changes required
🤔 Fixed global batch size (does not scale with number of GPUs)
✅ Model (weights, optimizer state, activations) gets distributed across all GPUs
✅ Parallelizes the computation of layers that are too large to fit onto a single GPU
❗ Requires lots of knowledge about model architecture to set configuration options correctly
🤔 Less GPU data transfers required, but data transfers don't overlap with computation like in FSDP

2D Parallel (FSDP + TP)

❗ Model code changes required
✅ Training with very large batch sizes (batch size scales across data-parallel dimension)
✅ Model (weights, optimizer state, activations) gets distributed across all GPUs
✅ Parallelizes the computation of layers that are too large to fit onto a single GPU
❗ Requires lots of knowledge about model architecture to set configuration options correctly
✅ Tensor-parallel within machines and FSDP across machines reduces data transfer bottlenecks

Lightning Fabric supports all the parallelisms mentioned above natively through PyTorch, with the exception of pipeline parallelism (PP) which is not yet supported.