Multiple GPUs, Now for Notebooks
tl;dr this tutorial covers newly-enabled multi-gpu support for notebooks in the Lightning framework.
Whether you like to prototype models quickly in Jupyter notebooks, Kaggle or Google Colab, Lightning’s got you covered. With the release of 1.7, notebook users get to try a shiny new strategy that provides them with a multi-GPU experience, similar to running a regular script. If you’ve been using the the various accelerators and strategies in Lightning outside of Jupyter notebooks, you know that scaling to multiple devices can make a big difference in training time. Naturally, when you’re prototyping in Jupyter, you’d like to get that same experience. And now, you can!
This is how you use multiple GPUs:
Lightning can detect whether you are in an interactive environment and will automatically select the DDP Notebook strategy when you set devices to multiple 🤯.
So, you get to make better use of your hardware and scale your models in Jupyter notebooks. Is that all?
Not quite. There are actually several more reasons to get really excited about this feature. 😎
· · ·
Everything is better with TorchMetrics
Until now, the only supported multi-GPU strategy in Jupyter was the DataParallel strategy (
strategy="dp"). However, DP has a load of limitations that often obstruct users and their workflows:
- DP is slower than DDP, which is why we always recommend
strategy="ddp"in Lightning as the go-to strategy for multi-GPU due to its reliability and speed.
- DP only supports primitive data types as input because it has to be able to “split” the batch and send each portion to the corresponding GPU. This is problematic for inputs like graphs in graph neural networks (GNNs). DDP doesn’t have this limitation.
- Perhaps most importantly, **TorchMetrics is not supported with DP**. This is again a limitation in the design of DP, as any updates to state other than parameters during forward get lost. When dealing with metrics computations in multi-device setups, TorchMetrics is an invaluable tool because it provides an automatic synchronization that guarantees correctness in the way the metric is reduced. Not having access to TorchMetrics and its tight integration with Lightning is a hard pill to swallow, and another reason why DP should be avoided in general.
If you’ve moved away from Jupyter notebooks or had to live without multi-GPU because of these limitations, then you’ll hopefully be pleased by these changes and enjoy the benefits of DDP regardless of the environment you’re in.
· · ·
With the ability to use DDP in your notebooks comes the additional benefit of better code portability. When you are done prototyping, it is now much easier to convert your code into a “production ready” script for large scale training, for example on your cluster, in the cloud, or even a Lightning App.
· · ·
Lightning enables DDP in notebooks through process forking, and we do this simply because it is currently the only way multiprocessing can be supported in interactive environments. Forking, however, is not just limited to notebooks, and in fact you can also use it in a regular script like this:
With forking, you get the additional benefit of easy memory sharing between processes. This is useful in scenarios where you have large data structures (like graphs) to which you need fast read access but can only fit into CPU memory (because they’re too big for GPU memory).
Normally, when spawning new processes, each child would inherit the memory from its parent as an entire redundant copy of its own. This is inefficient and can lead to OOM when your entire dataset resides in CPU memory.
With forking, the child processes inherit the memory through something called copy-on-write, which means that the memory remains shared as long as the children only read from it. Thus, when using large in-memory datasets you get an immediate memory saving plus faster startup time for the child processes.
You can learn more about the various differences and tradeoffs between DDP variants in Lightning in our documentation.
· · ·
This feature is experimental at the moment and we’re currently evaluating its stability. DDP Notebook/Fork is only available on MacOS and Linux, because Windows doesn’t support process-forking. One important code limitation is that CUDA functions can’t be called in the main process before the GPU processes get forked, otherwise you will see a crash. This includes things like moving tensors to GPU and calling
torch.cuda.* utility functions. However, because Lightning abstracts away all the accelerator boilerplate, this can usually be avoided entirely.
DON’T do this:
DO THIS instead:
· · ·
We’re very excited to now enable multi-GPU support in Jupyter notebooks, and we hope you enjoy this feature. Stay tuned for upcoming posts where we will dive deeper into some of the key features of PyTorch Lightning 1.7. If you have any feedback, or just want to get in touch, we’d love to hear from you on our Community Slack!