Train models with billions of parameters
Audience: Users who want to train massive models of billions of parameters efficiently across multiple GPUs and machines.
Lightning provides advanced and optimized model-parallel training strategies to support massive models of billions of parameters. Check out this amazing video for an introduction to model parallelism and its benefits:
When NOT to use model-parallel strategies
Model parallel techniques help when model sizes are fairly large; roughly 500M+ parameters is where we’ve seen benefits. For small models (for example ResNet50 of around 80M Parameters) where the weights, activations, optimizer states and gradients all fit in GPU memory, you do not need to use a model-parallel strategy. Instead, use regular distributed data-parallel (DDP) training to scale your batch size and speed up training across multiple GPUs and machines. There are several DDP optimizations you can explore if memory and speed are a concern.
Choosing the right strategy for your use case
If you’ve determined that your model is large enough that you need to leverage model parallelism, you have two training strategies to choose from: FSDP, the native solution that comes built-in with PyTorch, or the popular third-party DeepSpeed library. Both have a very similar feature set and have been used to train the largest SOTA models in the world. Our recommendation is
Use FSDP if you are new to model-parallel training, if you are migrating from PyTorch FSDP to Lightning, or if you are already familiar with DDP.
Use DeepSpeed if you know you will need cutting edge features not present in FSDP, or you are already familiar with DeepSpeed and are migrating to Lightning.
The table below points out a few important differences between the two.
Get started
Once you’ve chosen the right strategy for your use case, follow the full guide below to get started.
Third-party strategies
Cutting-edge Lightning strategies are being developed by third-parties outside of Lightning. If you want to try some of the latest and greatest features for model-parallel training, check out these strategies.