Train models with billions of parameters

Audience: Users who want to train massive models of billions of parameters efficiently across multiple GPUs and machines.

Lightning provides advanced and optimized model-parallel training strategies to support massive models of billions of parameters. Check out this amazing video for an introduction to model parallelism and its benefits:

When NOT to use model-parallel strategies

Model parallel techniques help when model sizes are fairly large; roughly 500M+ parameters is where we’ve seen benefits. For small models (for example ResNet50 of around 80M Parameters) where the weights, activations, optimizer states and gradients all fit in GPU memory, you do not need to use a model-parallel strategy. Instead, use regular distributed data-parallel (DDP) training to scale your batch size and speed up training across multiple GPUs and machines. There are several DDP optimizations you can explore if memory and speed are a concern.

Choosing the right strategy for your use case

If you’ve determined that your model is large enough that you need to leverage model parallelism, you have two training strategies to choose from: FSDP, the native solution that comes built-in with PyTorch, or the popular third-party DeepSpeed library. Both have a very similar feature set and have been used to train the largest SOTA models in the world. Our recommendation is

Use FSDP if you are new to model-parallel training, if you are migrating from PyTorch FSDP to Lightning, or if you are already familiar with DDP.
Use DeepSpeed if you know you will need cutting edge features not present in FSDP, or you are already familiar with DeepSpeed and are migrating to Lightning.

The table below points out a few important differences between the two.

Differences between FSDP and DeepSpeed
	FSDP	DeepSpeed
Dependencies	None	Requires the `deepspeed` package
Configuration options	Simpler and easier to get started	More comprehensive, allows finer control
Configuration	Via Trainer	Via Trainer or configuration file
Activation checkpointing	Yes	Yes, but requires changing the model code
Offload parameters	CPU	CPU or disk
Distributed checkpoints	Coming soon	Yes

Get started

Once you’ve chosen the right strategy for your use case, follow the full guide below to get started.

FSDP

Distribute models with billions of parameters across hundreds GPUs with FSDP

advanced

DeepSpeed

Distribute models with billions of parameters across hundreds GPUs with DeepSpeed

advanced

Third-party strategies

Cutting-edge Lightning strategies are being developed by third-parties outside of Lightning. If you want to try some of the latest and greatest features for model-parallel training, check out these strategies.