Using DeepSpeed to Optimize Models

DeepSpeed is an open-source optimization library for PyTorch that accelerates the training and inference of deep learning models. It was designed by Microsoft to address the challenges faced by companies and developers seeking to leverage large models, such as memory constraints and slow training times, and to improve the overall performance and efficiency of deep learning workflows. In this blog, we discuss various techniques that you can use to get the most out of your deep learning models.

Model Pruning

Reducing the size and complexity of a trained model without sacrificing accuracy.

One of the simplest and most effective techniques to optimize your deep learning models is model pruning. There are a few ways you can go about doing this, but most methods include the removal of weights or connections that don’t contribute significantly to the model’s overall output. Pruning involves removing unimportant weights from the model, reducing the size of the model and making it faster to train and infer. During the training process, pruning can actually be integrated by applying constraints which encourage the model to remove those weights or connections.

DeepSpeed supports several pruning techniques, including weight pruning, structure pruning, and layer pruning, which can be applied to any model architecture.

Micro-Batching

Splitting large datasets into smaller subsets which are then individually processed.

Another technique you can use to optimize your deep learning models is micro-batching, which involves splitting your training data into smaller batches rather than using the entire dataset at once. Throughout this process, your micro-batches are processed by the model and the resultant gradients then update the model’s weights. You repeat this process for each micro-batch until your whole dataset has been processed. The primary benefit of this technique is that you can process these batches using limited resources, if you’re limited by memory or GPU constraints.

In DeepSpeed, micro-batching can be enabled with a simple configuration change, allowing you to easily control the batch size and optimize performance.

Model Parallelism

Distributing a model across multiple GPUs or machines.

Model parallelism allows your model to be trained on larger datasets by splitting it across multiple machines. This can significantly reduce the training time and improve the performance of large models. In using this technique, the different portions of your model are processed in parallel, which actually allows large models to be trained faster than they would be on a single machine. The different portions of your model communicate with each other through a communication layer (which is usually a high-speed network) in order to coordinate their computations. This is an important technique when the size of your model exceeds the memory capacity of a single machine.

DeepSpeed supports model parallelism, making it easy to distribute the model across multiple GPUs or machines, while also reducing the memory footprint of the model.

Zero Redundancy Optimizer (ZeRO)

ZeRO is a new optimization algorithm introduced by DeepSpeed that optimizes the communication between different GPUs.

This algorithm reduces communication overhead and improving performance by eliminating redundant parameter storage. It partitions your model’s parameters across several different machines and then consolidates them to reduce memory consumption.

It’s implemented using three components: optimizer state partitioning, optimizer state memory management, and gradient accumulation. In short, these components splits the model’s parameters across different devices, consolidates redundant parameter storage, and then accumulates gradients locally before communication.

ZeRO works with DeepSpeed’s micro-batching and model parallelism features.

Distributed Data Parallelism

This technique parallelizes the computation step of large model training across multiple machines.

DeepSpeed also supports Distributed Data Parallelism (DDP), which involves splitting computation across multiple GPUs or machines, allowing the model to be trained on larger datasets. When implemented DDP, you distribute input data across several machines and process them in parallel, reducing training time. The gradients computed by each device are average and used to update your model’s weights. This repeats for each portion until your model has been trained on the entire dataset. DDP is distinct from model parallelism because it can be simpler to implement and scale to a large number of machines, making it more suitable for the kinds of very large models we’re increasingly starting to see.

You can enable DDP within DeepSpeed with a simple configuration change.

Hybrid Parallelism

A combination of model parallelism and DDP.

DeepSpeed also supports hybrid parallelism, which combines model parallelism and data parallelism, allowing your model to be trained on large datasets. Like other model parallelizing techniques, it partitions parts of your model across multiple devices and processes them in parallel. By combining these different techniques, you get to leverage the benefits of both in tandem, improving scalability, efficiency, and speed.

You can enable Hybrid Parallelism within DDP with a simple configuration change.

Automatic Mixed Precision Training

This technique uses numerical representations with reduced precision for certain portions of the training process.

Using lower precision data types for certain parts of the model can significantly reduce your model’s memory footprint and improve the performance of the training process. The idea is to only use high-precision for operations that actually require it, and use lower precision for operations that are less precision-dependent. For example, model parameters might be stored as 32-bit floating point numbers, while activations and gradients are stored as 16-bit.

With DeepSpeed, automatic mixed precision training can be enabled with a simple configuration change.

Wrap up

DeepSpeed is a powerful optimization library that can help you get the most out of your deep learning models. Introducing any of these techniques, however, can complicate your training process and add additional overhead to your work.

A great place to discuss these tradeoffs and explore the limitations of DeepSpeed (and other optimization libraries like it) is on Lightning’s community Discord.

Join the Lightning Discord!

Model Pruning

Micro-Batching

Model Parallelism

Zero Redundancy Optimizer (ZeRO)

Distributed Data Parallelism

Hybrid Parallelism

Automatic Mixed Precision Training

Wrap up

Related Content

Lightning AI Joins AI Alliance To Advance Open, Safe, Responsible AI

Doubling Neural Network Finetuning Efficiency with 16-bit Precision Techniques

Lightning 2.1: Train Bigger, Better, Faster