Train models with billions of parameters

Audience: Users who want to train massive models of billions of parameters efficiently across multiple GPUs and machines.

Lightning provides advanced and optimized model-parallel training strategies to support massive models of billions of parameters. Check out this amazing video for an introduction to model parallelism and its benefits:

When NOT to use model-parallel strategies

Model parallel techniques help when model sizes are fairly large; roughly 500M+ parameters is where we’ve seen benefits. For small models (for example ResNet50 of around 80M Parameters) where the weights, activations, optimizer states and gradients all fit in GPU memory, you do not need to use a model-parallel strategy. Instead, use regular distributed data-parallel (DDP) training to scale your batch size and speed up training across multiple GPUs and machines. There are several DDP optimizations you can explore if memory and speed are a concern.

Choosing the right strategy for your use case

If you’ve determined that your model is large enough that you need to leverage model parallelism, you have two training strategies to choose from: FSDP, the native solution that comes built-in with PyTorch, or the popular third-party DeepSpeed library. Both have a very similar feature set and have been used to train the largest SOTA models in the world. Our recommendation is

  • Use FSDP if you are new to model-parallel training, if you are migrating from PyTorch FSDP to Lightning, or if you are already familiar with DDP.

  • Use DeepSpeed if you know you will need cutting edge features not present in FSDP, or you are already familiar with DeepSpeed and are migrating to Lightning.

The table below points out a few important differences between the two.

Differences between FSDP and DeepSpeed





Requires the deepspeed package

Configuration options

Simpler and easier to get started

More comprehensive, allows finer control


Via Trainer

Via Trainer or configuration file

Activation checkpointing


Yes, but requires changing the model code

Offload parameters


CPU or disk

Distributed checkpoints

Coming soon


Get started

Once you’ve chosen the right strategy for your use case, follow the full guide below to get started.

Third-party strategies

Cutting-edge Lightning strategies are being developed by third-parties outside of Lightning. If you want to try some of the latest and greatest features for model-parallel training, check out these strategies.