Lightning AI Studios: Never set up a local environment again →

Log in or create a free Lightning.ai account to track your progress and access additional course materials  

Unit 9.2 Multi-GPU Training Strategies

References

What we covered in this video lecture

In this lecture, we explored the topic of GPU (Graphics Processing Unit) training on multiple GPUs and the inherent benefits and strategies it offers for large-scale machine learning tasks. As a critical point, GPUs, which have a much higher number of cores compared to CPUs, are well-suited for parallel computations, making them ideal for training machine learning models.

We then delved into the various categories of parallelism that harness the power of multiple GPUs: data parallelism, model parallelism, pipeline parallelism, tensor parallelism, and sequence parallelism.

For example, data parallelism involves distributing different subsets of the training data across multiple GPUs and then aggregating the gradients for the model update. Model parallelism splits the model itself across GPUs, where each GPU computes a part of the forward and backward pass. Tensor parallelism, on the other hand, is a more recent approach that splits the model’s tensors across multiple GPUs to handle extremely large models that don’t fit into a single GPU memory.

These techniques, in tandem or isolation, allow for the optimization of computational resources, speed up the training process, and enable the handling of larger models and datasets, thereby making multi-GPU training a key aspect of modern machine learning infrastructure.

Additional resources if you want to learn more

There are several guides on the PyTorch Lightning documentation website that I highly recommend reading for more advanced usages:

Log in or create a free Lightning.ai account to access:

  • Quizzes
  • Completion badges
  • Progress tracking
  • Additional downloadable content
  • Additional AI education resources
  • Notifications when new units are released
  • Free cloud computing credits

Quiz: 9.2 Multi-GPU Training Strategies (Part 1)

Which muti-gpu strategy is recommended when using “accelerator=mps” on Apple devices instead of “accelerator=gpu”?

-1" name="question-1" value="

strategy="ddp"

-1">

Incorrect. Hint: Apple Silicon computers currently don’t have multiple GPUs, so there is no strategy to select for multi-GPU training when using Apple devices at the moment.

Correct. This was a trick question, because Apple Silicon computers currently don’t have multiple GPUs, so there is no strategy to select for multi-GPU training when using Apple devices at the moment.

-3" name="question-1" value="

strategy="ddp_spawn"

-3">

Incorrect. Hint: Apple Silicon computers currently don’t have multiple GPUs, so there is no strategy to select for multi-GPU training when using Apple devices at the moment.

-4" name="question-1" value="

strategy="ddp_notebook"

-4">

Incorrect. Hint: Apple Silicon computers currently don’t have multiple GPUs, so there is no strategy to select for multi-GPU training when using Apple devices at the moment.

Please answer all questions to proceed.

Quiz: 9.2 Multi-GPU Training Strategies (Part 2)

In which type of parallelism is each layer of the model computed across all GPUs, but each GPU only computes a subset of the neurons in each layer?

Incorrect. Here, each GPU only has a subset of the total number of layers.

Correct. Here, the computation of each layer is split across multiple GPUs.

Please answer all questions to proceed.
Watch Video 1

Unit 9.2

Videos