Unit 9.3 Deep Dive Into Data Parallelism

In today’s lecture, we focused on the concept of data parallelism and its extension, distributed data parallelism, both essential strategies for accelerating machine learning training using multiple computational resources.

Data parallelism is a technique where the training data is divided into multiple subsets, and each subset is processed independently across multiple GPUs or computing nodes. This allows for simultaneous computation, significantly reducing the training time. The computed gradients from each subset are then aggregated to update the model parameters. However, in the context of a single machine with multiple GPUs, this method can be limited by the inter-GPU communication speed and the machine’s memory capacity.

To overcome the limitations of regular data parallelism in PyTorch, we discussed distributed data parallelism, an extension of data parallelism that spans across multiple machines, each with one or more GPUs. In distributed data parallelism, the same model is replicated on each machine, and every machine processes a different subset of the training data. This not only facilitates handling larger datasets and models but also improves the training speed, taking advantage of the collective memory and computational power of multiple machines.