{"id":5648335,"date":"2023-07-02T06:12:52","date_gmt":"2023-07-02T10:12:52","guid":{"rendered":"https:\/\/lightning.ai\/pages\/?p=5648335"},"modified":"2024-02-01T09:50:34","modified_gmt":"2024-02-01T14:50:34","slug":"pytorch-memory-vit-llm","status":"publish","type":"post","link":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/","title":{"rendered":"Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch"},"content":{"rendered":"<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Key takeaway<\/h3><br \/>\nPeak memory consumption is a common bottleneck when training deep learning models such as vision transformers and LLMs. This article provides a series of techniques that can lower memory consumption in PyTorch by approximately 20x without sacrificing modeling performance and prediction accuracy.<br \/>\n<\/div>\n<h2 id=\"toc1\" class=\"md-end-block md-heading md-focus\"><span class=\"md-plain md-expand\">Introduction<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In this article, we will be exploring 10 easily-accessible techniques to reduce memory usage in PyTorch. These techniques are cumulative, meaning we can apply them on top of one another. <\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">We will begin working with a vision transformer from PyTorch&#8217;s <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/pytorch.org\/vision\/stable\/index.html\"><span class=\"md-plain\">Torchvision<\/span><\/a><\/span><span class=\"md-plain\"> library to provide simple code examples that you can execute on your own machine without downloading and installing too many code and dataset dependencies. The self-contained baseline training script consists of ~100 lines of code (ignoring the whitespaces and code comments). All code examples are available <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\"><span class=\"md-plain\">here on GitHub<\/span><\/a><\/span><span class=\"md-plain\">.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Here&#8217;s an outline of the sections and techniques we are going to cover (you can access these sections directly from the Table of Contents on the right):<\/span><\/p>\n<ol>\n<li class=\"md-end-block md-p\"><span class=\"md-plain\">Finetuning a Vision Transformer<\/span><\/li>\n<li class=\"md-end-block md-p\"><span class=\"md-plain\">Automatic Mixed-Precision Training<\/span><\/li>\n<li class=\"md-end-block md-p\"><span class=\"md-plain\">Lower-Precision Training<\/span><\/li>\n<li class=\"md-end-block md-p\"><span class=\"md-plain\">Training with Reduced Batch Size<\/span><\/li>\n<li class=\"md-end-block md-p\"><span class=\"md-plain\">Gradient Accumulation and Microbatches<\/span><\/li>\n<li class=\"md-end-block md-p\"><span class=\"md-plain\">Choosing Leaner Optimizers<\/span><\/li>\n<li class=\"md-end-block md-p\"><span class=\"md-plain\">Instantiating Models on the Target Device<\/span><\/li>\n<li class=\"md-end-block md-p\"><span class=\"md-plain\">Distributed Training and Tensor Sharding<\/span><\/li>\n<li>Activation Checkpointing<\/li>\n<li class=\"md-end-block md-p\"><span class=\"md-plain\">Parameter Offloading <\/span><\/li>\n<li class=\"md-end-block md-p\"><span class=\"md-plain\">Putting It All Together: Training an LLM<\/span><\/li>\n<\/ol>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">While we are working with a vision transformer here (the ViT-L-16 model from the paper <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2010.11929\"><span class=\"md-plain\">An Image is Worth 16&#215;16 Words: Transformers for Image Recognition at Scale<\/span><\/a><\/span><span class=\"md-plain\">), all the techniques used in this article transfer to other models as well: Convolutional networks, large language models (LLMs), and others.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Furthermore, after introducing one technique at a time using the abovementioned vision transformer example, we will apply these to train a BigBird-Roberta LLM on a text classification task. It wouldn&#8217;t be possible to train such a model on consumer hardware without these techniques.<\/span><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">PS: Note that there are many sections in this article. To not bloat this article further, I will keep each section purposefully short but provide links to more detailed articles on the individual topics.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc2\" class=\"md-end-block md-heading\"><span class=\"md-plain\">1) Finetuning a Vision Transformer<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">To simplify the PyTorch code for the experiments, we will be introducing the <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/docs\/fabric\/stable\/\"><span class=\"md-plain\">open-source Fabric library<\/span><\/a><\/span><span class=\"md-plain\">, which allows us to apply various advanced PyTorch techniques (automatic mixed-precision training, multi-GPU training, tensor sharding, etc.) with a handful (instead of dozens) lines of code.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The difference between simple PyTorch code and the modified one to use Fabric is subtle and involves only minor modifications, as highlighted in the code below:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648336 \" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/4_pytorch_plus_fabric.png\" alt=\"\" width=\"1175\" height=\"794\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/4_pytorch_plus_fabric.png 2204w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/4_pytorch_plus_fabric-300x203.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/4_pytorch_plus_fabric-1024x691.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/4_pytorch_plus_fabric-1536x1037.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/4_pytorch_plus_fabric-2048x1383.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/4_pytorch_plus_fabric-300x203@2x.png 600w\" sizes=\"(max-width: 1175px) 100vw, 1175px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">As mentioned above, these minor changes now provide a gateway to utilize advanced features in PyTorch, as we will see in a bit, without restructuring any more of the existing code.<\/span><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">To summarize the figure above, the main 3 steps for converting plain PyTorch code to PyTorch+Fabric are as follows:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648337\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/5_steps.png\" alt=\"\" width=\"890\" height=\"577\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/5_steps.png 1882w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/5_steps-300x194.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/5_steps-1024x664.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/5_steps-1536x996.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/5_steps-300x194@2x.png 600w\" sizes=\"(max-width: 890px) 100vw, 890px\" \/><\/p>\n<p>&nbsp;<\/p>\n<ol class=\"ol-list\" start=\"\">\n<li class=\"md-list-item md-focus-container\">\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">Import Fabric and instantiate a Fabric object.<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Use Fabric to set up the model, the optimizer, and the data loader.<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Call <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>fabric.backward()<\/code><\/span><span class=\"md-plain\"> on the loss instead of the usual <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>loss.backward()<\/code><\/span><\/p>\n<\/li>\n<\/ol>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The vision transformer is based on the <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2010.11929\"><span class=\"md-plain\">original ViT architecture<\/span><\/a><\/span><span class=\"md-plain\">, and the code is available here for inspection. Note that we are finetuning the model for classification instead of training it from scratch to optimize predictive performance.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">As a quick sanity check, the predictive performance and memory consumption using plain PyTorch and PyTorch with Fabric remains exactly the same (+\/- expected fluctuations due to randomness):<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s \"><strong><span class=\"md-plain\">Plain PyTorch (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\"><span class=\"md-plain\">01_pytorch-vit.py<\/span><\/a><\/span><span class=\"md-plain\">):<\/span><\/strong><\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"\" spellcheck=\"false\"><span role=\"presentation\">Time elapsed 17.94 min<\/span>\r\n<span role=\"presentation\">Memory used: 26.79 GB<\/span>\r\n<span role=\"presentation\">Test accuracy 95.85%<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s \"><strong><span class=\"md-plain\">PyTorch with Fabric (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/01-2_pytorch-fabric.py\"><span class=\"md-plain\">01-2_pytorch-fabric.py<\/span><\/a><\/span><span class=\"md-plain\">)<\/span><\/strong><\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"\" spellcheck=\"false\"><span role=\"presentation\">Time elapsed 17.88 min<\/span>\r\n<span role=\"presentation\">Memory used: 26.84 GB<\/span>\r\n<span role=\"presentation\">Test accuracy 96.06%<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">As an optional exercise, you are welcome to experiment with the code and replace<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">model<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">vit_l_16<\/span>(<span class=\"cm-variable\">weights<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-variable\">ViT_L_16_Weights<\/span>.<span class=\"cm-property\">IMAGENET1K_V1<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">with<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">model<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">vit_l_16<\/span>(<span class=\"cm-variable\">weights<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-keyword\">None<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">This will train the same vision transformer architecture from scratch instead of finetuning it. If you carry out this exercise, you&#8217;ll see that the prediction accuracy drops from &gt;96% down to ~60%:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5648338\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/02_finetuning.png\" alt=\"\" width=\"2064\" height=\"914\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/02_finetuning.png 2064w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/02_finetuning-300x133.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/02_finetuning-1024x453.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/02_finetuning-1536x680.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/02_finetuning-2048x907.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/02_finetuning-300x133@2x.png 600w\" sizes=\"(max-width: 2064px) 100vw, 2064px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc3\" class=\"md-end-block md-heading md-focus\"><span class=\"md-plain md-expand\">2) Automatic Mixed-Precision <\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In the previous section, we modified our PyTorch code using Fabric. Why go through all this hassle? As we will see below, we can now try advanced techniques, like mixed-precision and distributed training, by only changing one line of code.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">We will start with mixed-precision training, which has become the recent norm for training deep neural networks. <\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s \"><strong><span class=\"md-plain\">Applying Mixed-Precision Training<\/span><\/strong><\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">We can apply mixed-precision training with only one small modification, changing <\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">fabric<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">Fabric<\/span>(<span class=\"cm-variable\">accelerator<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"cuda\"<\/span>, <span class=\"cm-variable\">devices<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">1<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">to the following:<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">fabric<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">Fabric<\/span>(<span class=\"cm-variable\">accelerator<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"cuda\"<\/span>, <span class=\"cm-variable\">devices<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">1<\/span>, <span class=\"cm-variable\">precision<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"16-mixed\"<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">As a result, our memory consumption is reduced from 26.84 GB to 18.21 GB without sacrificing prediction accuracy, as shown below.<\/span><\/p>\n<p>&nbsp;<\/p>\n<div id=\"attachment_5648390\" style=\"width: 1840px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648390\" class=\"wp-image-5648390 size-full\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/mixed-new.png\" alt=\"a\" width=\"1830\" height=\"502\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/mixed-new.png 1830w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/mixed-new-300x82.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/mixed-new-1024x281.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/mixed-new-1536x421.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/mixed-new-300x82@2x.png 600w\" sizes=\"(max-width: 1830px) 100vw, 1830px\" \/><p id=\"caption-attachment-5648390\" class=\"wp-caption-text\"><span class=\"md-plain\">Comparing <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/01-2_pytorch-fabric.py\"><span class=\"md-plain\">01-2_pytorch-fabric.py<\/span><\/a><\/span><span class=\"md-plain\"> and <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/02_mixed-precision.py\"><span class=\"md-plain\">02_mixed-precision.py<\/span><\/a><\/span><\/p><\/div>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">As a bonus, mixed-precision training doesn&#8217;t only reduce memory usage but also reduces the runtime 6-fold (from 17.88 min to 3.45 min), which is a nice, added benefit; however, the focus of this particular article is on memory consumption to not complicate it further.<\/span><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-pair-s md-expand\"><strong><span class=\"md-plain\">What Is Mixed-Precision Training?<\/span><\/strong><\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Mixed precision training uses both 16-bit and 32-bit precision to ensure no loss in accuracy. The computation of gradients in the 16-bit representation is much faster than in the 32-bit format and saves a significant amount of memory. This strategy is beneficial, especially when we are memory or compute-constrained.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">It&#8217;s called &#8220;mixed-&#8220;rather than &#8220;low-&#8220;precision training because we don&#8217;t transfer <\/span><span class=\"md-pair-s \"><em><span class=\"md-plain\">all<\/span><\/em><\/span><span class=\"md-plain\"> parameters and operations to 16-bit floats. Instead, we switch between 32-bit and 16-bit operations during training, hence, the term &#8220;mixed&#8221; precision.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">As illustrated in the figure below, mixed-precision training involves converting weights to lower-precision (FP16) for faster computation, calculating gradients, converting gradients back to higher-precision (FP32) for numerical stability, and updating the original weights with the scaled gradients.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">This approach allows for efficient training while maintaining the accuracy and stability of the neural network. <\/span><\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648340\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/8_mixed-training-1024x339.png\" alt=\"\" width=\"622\" height=\"206\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/8_mixed-training-1024x339.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/8_mixed-training-300x99.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/8_mixed-training-1536x509.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/8_mixed-training.png 1788w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/8_mixed-training-300x99@2x.png 600w\" sizes=\"(max-width: 622px) 100vw, 622px\" \/><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">For additional details, I recommend reading my more detailed standalone article <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/accelerating-large-language-models-with-mixed-precision-techniques\/\"><span class=\"md-plain\">Accelerating Large Language Models with Mixed-Precision Techniques<\/span><\/a><\/span><span class=\"md-plain\">, where I dive deeper into the underlying concepts.<\/span><\/p>\n<h2 id=\"toc4\" class=\"md-end-block md-heading md-focus\"><span class=\"md-plain md-expand\">3) Lower-Precision Training<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">We can also take it a step further and try running with &#8220;full&#8221; lower 16-bit precision (instead of mixed-precision, which converts intermediate results to a 32-bit representation.)<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">We can enable lower-precision training by changing<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">fabric<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">Fabric<\/span>(<span class=\"cm-variable\">accelerator<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"cuda\"<\/span>, <span class=\"cm-variable\">precision<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"16-mixed\"<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">to the following: <\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">fabric<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">Fabric<\/span>(<span class=\"cm-variable\">accelerator<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"cuda\"<\/span>, <span class=\"cm-variable\">precision<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"16-true\"<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">However, you may notice that when running this code, you&#8217;ll encounter NaN values in the loss:<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"\" spellcheck=\"false\"><span role=\"presentation\">Epoch: 0001\/0001 | Batch 0000\/0703 | Loss: 2.4105<\/span>\r\n<span role=\"presentation\">Epoch: 0001\/0001 | Batch 0300\/0703 | Loss: nan<\/span>\r\n<span role=\"presentation\">Epoch: 0001\/0001 | Batch 0600\/0703 | Loss: nan<\/span>\r\n<span role=\"presentation\">...<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">This is because regular 16-bit floats can only represent numbers between -65,504 and 65,504:<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">In<\/span> [<span class=\"cm-number\">1<\/span>]: <span class=\"cm-keyword\">import<\/span> <span class=\"cm-variable\">torch<\/span><\/span>\r\n<span role=\"presentation\">\u200b<\/span>\r\n<span role=\"presentation\"><span class=\"cm-variable\">In<\/span> [<span class=\"cm-number\">2<\/span>]: <span class=\"cm-variable\">torch<\/span>.<span class=\"cm-property\">finfo<\/span>(<span class=\"cm-variable\">torch<\/span>.<span class=\"cm-property\">float16<\/span>)<\/span>\r\n<span role=\"presentation\"><span class=\"cm-variable\">Out<\/span>[<span class=\"cm-number\">2<\/span>]: <span class=\"cm-variable\">finfo<\/span>(<span class=\"cm-variable\">resolution<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">0.001<\/span>, <span class=\"cm-builtin\">min<\/span><span class=\"cm-operator\">=-<\/span><span class=\"cm-number\">65504<\/span>, <span class=\"cm-builtin\">max<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">65504<\/span>, <span class=\"cm-variable\">eps<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">0.000976562<\/span>, <span class=\"cm-variable\">smallest_normal<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">6.10352e-05<\/span>, <span class=\"cm-variable\">tiny<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">6.10352e-05<\/span>, <span class=\"cm-variable\">dtype<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-variable\">float16<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">So, to avoid the NaN issue, we can use the &#8220;<\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>bf16-true<\/code><\/span><span class=\"md-plain\">&#8221; setting.<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">fabric<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">Fabric<\/span>(<span class=\"cm-variable\">accelerator<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"cuda\"<\/span>, <span class=\"cm-variable\">precision<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"bf16-true\"<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">As a result, we can reduce the memory consumption even further down to 13.82 GB (again, without sacrificing accuracy):<\/span><\/p>\n<div id=\"attachment_5648393\" style=\"width: 1820px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648393\" class=\"wp-image-5648393 size-full\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat-new-1.png\" alt=\"a\" width=\"1810\" height=\"562\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat-new-1.png 1810w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat-new-1-300x93.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat-new-1-1024x318.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat-new-1-1536x477.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat-new-1-300x93@2x.png 600w\" sizes=\"(max-width: 1810px) 100vw, 1810px\" \/><p id=\"caption-attachment-5648393\" class=\"wp-caption-text\"><span class=\"md-plain\">Comparing <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/03_bfloat16.py\"><span class=\"md-plain\">03_bfloat16.py<\/span><\/a><\/span><span class=\"md-plain\"> to the previous codes<\/span><\/p><\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s \"><strong><span class=\"md-plain\">What Is Bfloat16?<\/span><\/strong><\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The &#8220;bf16&#8221; in <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>\"bf16-mixed\"<\/code><\/span><span class=\"md-plain\"> stands for <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/cloud.google.com\/tpu\/docs\/bfloat16\"><span class=\"md-plain\">Brain Floating Point<\/span><\/a><\/span><span class=\"md-plain\"> (bfloat16). Google developed this format for machine learning and deep learning applications, particularly in their Tensor Processing Units (TPUs). Bfloat16 extends the dynamic range compared to the conventional float16 format at the expense of decreased precision.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648342\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat16.png\" alt=\"\" width=\"634\" height=\"493\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat16.png 1640w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat16-300x233.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat16-1024x795.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat16-1536x1193.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/bfloat16-300x233@2x.png 600w\" sizes=\"(max-width: 634px) 100vw, 634px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">The extended dynamic range helps bfloat16 to represent very large and very small numbers, making it more suitable for deep learning applications where a wide range of values might be encountered. However, the lower precision may affect the accuracy of certain calculations or lead to rounding errors in some cases. But in most deep learning applications, this reduced precision has minimal impact on modeling performance.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">While bfloat16 was originally developed for TPUs, this format is now supported by several NVIDIA GPUs as well, beginning with the A100 Tensor Core GPUs, which are part of the NVIDIA Ampere architecture.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">You can check whether your GPU supports <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>bfloat16<\/code><\/span><span class=\"md-plain\"> via the following code:<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-operator\">&gt;&gt;&gt;<\/span> <span class=\"cm-keyword\">import<\/span> <span class=\"cm-variable\">torch<\/span><\/span>\r\n<span role=\"presentation\"><span class=\"cm-operator\">&gt;&gt;&gt;<\/span> <span class=\"cm-variable\">torch<\/span>.<span class=\"cm-property\">cuda<\/span>.<span class=\"cm-property\">is_bf16_supported<\/span>()<\/span>\r\n<span role=\"presentation\"><span class=\"cm-keyword\">True<\/span><\/span><\/pre>\n<h2 id=\"toc5\" class=\"md-end-block md-heading\"><span class=\"md-plain\">4) Reducing the Batchsize<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Let&#8217;s tackle one of the big elephants in the room: why don&#8217;t we simply reduce the batch size? This is usually always an option to reduce memory consumption. However, it can sometimes result in worse predictive performance since it alters the training dynamics. (For more details, <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/pages\/courses\/deep-learning-fundamentals\/9.0-overview-techniques-for-speeding-up-model-training\/unit-9.5-increasing-batch-sizes-to-increase-throughput\/\"><span class=\"md-plain\">see Lecture 9.5 in my Deep Learning Fundamentals course<\/span><\/a><\/span><span class=\"md-plain\">.) <\/span><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">Either way, let&#8217;s reduce the batch size to see how that affects the results. It turns out we can lower the batch size to 16, which brings memory consumption down to 5.69 GB, without sacrificing performance:<\/span><\/p>\n<div id=\"attachment_5648394\" style=\"width: 1974px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648394\" class=\"wp-image-5648394 size-full\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/new-lower-batch.png\" alt=\"a\" width=\"1964\" height=\"514\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/new-lower-batch.png 1964w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/new-lower-batch-300x79.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/new-lower-batch-1024x268.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/new-lower-batch-1536x402.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/new-lower-batch-300x79@2x.png 600w\" sizes=\"(max-width: 1964px) 100vw, 1964px\" \/><p id=\"caption-attachment-5648394\" class=\"wp-caption-text\"><span class=\"md-plain\">Comparing <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/04_lower-batchsize.py\"><span class=\"md-plain\">04_lower-batchsize.py<\/span><\/a><\/span><span class=\"md-plain\"> to the previous codes.<\/span><\/p><\/div>\n<p>&nbsp;<\/p>\n<h2 id=\"toc6\" class=\"md-end-block md-heading\"><span class=\"md-plain\">5) Using Gradient Accumulation to Create Microbatches<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Gradient accumulation is a way to virtually increase the batch size during training, which is very useful when the available GPU memory is insufficient to accommodate the desired batch size. Note that this only affects the runtime, not the modeling performance.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In gradient accumulation, gradients are computed for smaller batches and accumulated (usually summed or averaged) over multiple iterations instead of updating the model weights after every batch. Once the accumulated gradients reach the target \u201cvirtual\u201d batch size, the model weights are updated with the accumulated gradients.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">To enable gradient accumulation, there are only two small modifications to the forward and backward pass required:<\/span><\/p>\n<div id=\"attachment_5648344\" style=\"width: 809px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648344\" class=\"wp-image-5648344\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-acc.png\" alt=\"\" width=\"799\" height=\"549\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-acc.png 1534w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-acc-300x206.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-acc-1024x704.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-acc-300x206@2x.png 600w\" sizes=\"(max-width: 799px) 100vw, 799px\" \/><p id=\"caption-attachment-5648344\" class=\"wp-caption-text\"><span class=\"md-plain\">Code modification in <\/span><span class=\"md-meta-i-c md-link\"><a href=\"05_gradient-accum.py\"><span class=\"md-plain\">05_gradient-accum.py<\/span><\/a><\/span><\/p><\/div>\n<p>&nbsp;<\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">I covered gradient accumulation in more detail in my article <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\"><span class=\"md-plain\">Finetuning LLMs on a Single GPU Using Gradient Accumulation<\/span><\/a><\/span><span class=\"md-plain\">. <\/span><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">Using an effective batch size of 16 and 4 accumulation steps means we will use an actual batch size of 4 (since 16 \/ 4 = 4).<\/span><\/p>\n<div id=\"attachment_5648395\" style=\"width: 1984px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648395\" class=\"wp-image-5648395 size-full\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-accum.png\" alt=\"a\" width=\"1974\" height=\"542\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-accum.png 1974w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-accum-300x82.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-accum-1024x281.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-accum-1536x422.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/gradient-accum-300x82@2x.png 600w\" sizes=\"(max-width: 1974px) 100vw, 1974px\" \/><p id=\"caption-attachment-5648395\" class=\"wp-caption-text\"><span class=\"md-plain\">Result of <\/span><span class=\"md-meta-i-c md-link\"><a href=\"05_gradient-accum.py\"><span class=\"md-plain\">05_gradient-accum.py<\/span><\/a><\/span><\/p><\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain md-expand\">A disadvantage of this technique is that it increases the runtime from 3.96 min to 12.91 min.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Of course, we could even go smaller and e 16 accumulation steps. This would lead to a microbatch size of 1, reducing the memory size further (about 75%), but I&#8217;ll leave this as an optional exercise.<\/span><\/p>\n<h2 id=\"toc7\" class=\"md-end-block md-heading\"><span class=\"md-plain\">6) Using a Leaner Optimizer<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Did you know that the popular Adam optimizer comes with additional parameters? For instance, Adam has 2 additional optimizer parameters (a mean and a variance) for each model parameter. <\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">So, by swapping Adam with a stateless optimizer like SGD, we can reduce the number of parameters by 2\/3, which can be quite significant when working with vision transformers and LLMs.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The downside of plain SGD is that it usually has worse convergence properties. So, let&#8217;s swap Adam with SGD and introduce a cosine decay learning rate scheduler to compensate for this and achieve better convergence.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain md-expand\">In short, we will be swapping the previously used Adam optimizer:<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">optimizer<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">torch<\/span>.<span class=\"cm-property\">optim<\/span>.<span class=\"cm-property\">Adam<\/span>(<span class=\"cm-variable\">model<\/span>.<span class=\"cm-property\">parameters<\/span>(), <span class=\"cm-variable\">lr<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">5e-5<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">with an SGD optimizer plus scheduler:<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">optimizer<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">torch<\/span>.<span class=\"cm-property\">optim<\/span>.<span class=\"cm-property\">SGD<\/span>(<span class=\"cm-variable\">model<\/span>.<span class=\"cm-property\">parameters<\/span>(), <span class=\"cm-variable\">lr<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">0.01<\/span>)<\/span>\r\n<span role=\"presentation\">\u200b<\/span>\r\n<span role=\"presentation\"><span class=\"cm-variable\">num_steps<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">NUM_EPOCHS<\/span> <span class=\"cm-operator\">*<\/span> <span class=\"cm-builtin\">len<\/span>(<span class=\"cm-variable\">train_loader<\/span>)<\/span>\r\n<span role=\"presentation\"><span class=\"cm-variable\">scheduler<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">torch<\/span>.<span class=\"cm-property\">optim<\/span>.<span class=\"cm-property\">lr_scheduler<\/span>.<span class=\"cm-property\">CosineAnnealingLR<\/span>(<\/span>\r\n<span role=\"presentation\"> \u00a0 \u00a0<span class=\"cm-variable\">optimizer<\/span>, <span class=\"cm-variable\">T_max<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-variable\">num_steps<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain\">With this change, we are able to have the peak memory consumption while maintaining ~97% classification accuracy:<\/span><\/p>\n<div id=\"attachment_5648396\" style=\"width: 1982px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648396\" class=\"wp-image-5648396 size-full\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sgd-new.png\" alt=\"a\" width=\"1972\" height=\"494\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sgd-new.png 1972w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sgd-new-300x75.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sgd-new-1024x257.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sgd-new-1536x385.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sgd-new-300x75@2x.png 600w\" sizes=\"(max-width: 1972px) 100vw, 1972px\" \/><p id=\"caption-attachment-5648396\" class=\"wp-caption-text\"><span class=\"md-plain\">Result of <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/06_sgd-with-scheduler.py\"><span class=\"md-plain\">06_sgd-with-scheduler.py<\/span><\/a><\/span><\/p><\/div>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">If you want to learn more, I discussed learning rate schedulers (including cosine decay with a 1-cycle schedule) in more detail in my <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/pages\/courses\/deep-learning-fundamentals\/unit-6-overview-essential-deep-learning-tips-tricks\/unit-6.2-learning-rates-and-learning-rate-schedulers\/\"><span class=\"md-plain\">Unit 6.5 of my Deep Learning Fundamentals class<\/span><\/a><\/span><span class=\"md-plain\">.<\/span><\/p>\n<h2 id=\"toc8\" class=\"md-end-block md-heading\"><span class=\"md-plain\">7) Creating the Model on the Target Device with Desired Precision<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">When we instantiate a model in PyTorch, we usually create it on the CPU device first, and then we transfer it onto the target device and convert it to the desired precision:<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-variable\">model<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">vit_l_16<\/span>(<span class=\"cm-variable\">weights<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-variable\">ViT_L_16_Weights<\/span>.<span class=\"cm-property\">IMAGENET1K_V1<\/span>)<\/span>\r\n<span role=\"presentation\"><span class=\"cm-variable\">model<\/span>.<span class=\"cm-property\">cuda<\/span>().<span class=\"cm-property\">float16<\/span>()<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">This can be inefficient considering the intermediate model representation in full precision on the CPU. Instead, we can directly create the model in desired precision on the target device (e.g., GPU) using the <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>init_module<\/code><\/span><span class=\"md-plain\"> context in Fabric:<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"><span class=\"cm-keyword\">import<\/span> <span class=\"cm-variable\">lightning<\/span> <span class=\"cm-keyword\">as<\/span> <span class=\"cm-variable\">L<\/span><\/span>\r\n<span role=\"presentation\">\u200b<\/span>\r\n<span role=\"presentation\"><span class=\"cm-variable\">fabric<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">Fabric<\/span>(<span class=\"cm-variable\">accelerator<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"cuda\"<\/span>, <span class=\"cm-variable\">devices<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">1<\/span>, <span class=\"cm-variable\">precision<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"16-true\"<\/span>)<\/span>\r\n<span role=\"presentation\">\u200b<\/span>\r\n<span role=\"presentation\"><span class=\"cm-keyword\">with<\/span> <span class=\"cm-variable\">fabric<\/span>.<span class=\"cm-property\">init_module<\/span>():<\/span>\r\n<span role=\"presentation\"> \u00a0 \u00a0<span class=\"cm-variable\">model<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">vit_l_16<\/span>(<span class=\"cm-variable\">weights<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-variable\">ViT_L_16_Weights<\/span>.<span class=\"cm-property\">IMAGENET1K_V1<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In this specific case (model), the peak memory during the forward pass is larger than the model size in its full precision representation. So, we will benchmark the <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>fabric.init_module<\/code><\/span><span class=\"md-plain\"> approach just for the model loading itself.<\/span><\/p>\n<ul class=\"ul-list\" data-mark=\"-\">\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">GPU Peak memory without <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>init_module<\/code><\/span><span class=\"md-plain\">: 1.24 GB (07_01_init-module.py)<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">GPU Peak memory with <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>init_module<\/code><\/span><span class=\"md-plain\">: 0.65 GB (07_03_init-module.py)<\/span><\/p>\n<\/li>\n<\/ul>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">As we can see based on the results above, in this case, <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>init_module<\/code><\/span><span class=\"md-plain\"> reduces the peak memory requirements for model loading by 50%. We will be making use of this technique later in this article.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">For more details about <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>init_module<\/code><\/span><span class=\"md-plain\">, please see the more detailed article on <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/pages\/community\/efficient-initialization-of-large-models\/\"><span class=\"md-plain\">Efficient Initialization of Large Models<\/span><\/a><\/span><span class=\"md-plain\">.<\/span><\/p>\n<h2 id=\"toc9\" class=\"md-end-block md-heading\"><span class=\"md-plain\">8) Distributed Training and Tensor Sharding<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The next modification we are going to try is multi-GPU training. It becomes beneficial if we have multiple GPUs at our disposal since it allows us to train our models even faster. <\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">However, here, we are mainly interested in the memory saving. So, we are going to use a more advanced, distributed multi-GPU strategy called Fully Sharded Data Parallelism (FSDP), which utilizes both data parallelism and tensor parallelism for sharding large weight matrices across multiple devices.<\/span><\/p>\n<p class=\"md-end-block md-p hljs collapse-false language-python\">Note that the model is already very small, which is why we wouldn&#8217;t see any major effect when adding this technique to the code from section 7 above. Hence, to focus on the pure effect of sharding, we are going to experiment with a larger vision transformer from TorchVision:<\/p>\n<pre>model = vit_h_14(weights=ViT_H_14_Weights.IMAGENET1K_SWAG_E2E_V1)<\/pre>\n<p>We are changing<\/p>\n<pre class=\"hljs collapse-false language-python\">fabric = Fabric(accelerator=\"cuda\", devices=1, precision=\"16-mixed\")<\/pre>\n<p class=\"hljs collapse-false language-python\">to<\/p>\n<pre class=\"hljs collapse-false language-python\">fabric = Fabric(accelerator=\"cuda\", \r\n    devices=4, strategy=\"fsdp\", precision=\"16-mixed\")<\/pre>\n<p>Which reduces the memory from 22.63 GB to 19.83 GB.<\/p>\n<p>(Code example <a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/08-10-vit32\/08a_fsdp-defaults.py\">08a_fsdp-defaults.py<\/a>)<\/p>\n<p>Note that these are the default settings. However, the real benefit of FSDP is when we work with LLMs and ViTs that have more than 100 million parameters due to splitting parameters and computations across GPUs. To get more control, there are two modifications we can try by customizing the auto-wrap strategy:<\/p>\n<pre class=\"hljs collapse-false language-python\">from torchvision.models.vision_transformer import EncoderBlock<\/pre>\n<pre class=\"hljs collapse-false language-python\">auto_wrap_policy = partial(transformer_auto_wrap_policy,\r\n    transformer_layer_cls={EncoderBlock}\r\n)\r\nstrategy = FSDPStrategy(auto_wrap_policy=auto_wrap_policy)\r\nfabric = Fabric(accelerator=\"cuda\", devices=4,\r\n    strategy=strategy, precision=\"16-mixed\"\r\n)<\/pre>\n<p>This reduces the memory further from 19.83 GB to 17.63 GB<\/p>\n<p>(Code example <a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/08-10-vit32\/08b_fsdp-custom.py\">08b_fsdp-custom.py<\/a>)<\/p>\n<p>&nbsp;<\/p>\n<p>To get more control over the minimum size of the layers that are considered for sharding, we can also use a size-based auto-wrap policy:<\/p>\n<p>This lowers the default size of wrapping from 100 million to 2 million:<\/p>\n<pre class=\"hljs collapse-false language-python\">auto_wrap_policy = partial(\r\n    size_based_auto_wrap_policy, min_num_params=2_000_000\r\n)\r\nstrategy = FSDPStrategy(auto_wrap_policy=auto_wrap_policy)\r\nfabric = Fabric(accelerator=\"cuda\",\r\n    devices=4, strategy=strategy, precision=\"16-mixed\"\r\n)<\/pre>\n<p>This reduces the memory consumption further to 17.23 GB.<\/p>\n<div id=\"attachment_5648398\" style=\"width: 5287px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648398\" class=\"size-full wp-image-5648398\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_1_new.png\" alt=\"\" width=\"5277\" height=\"728\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_1_new.png 5277w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_1_new-300x41.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_1_new-1024x141.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_1_new-1536x212.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_1_new-2048x283.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_1_new-300x41@2x.png 600w\" sizes=\"(max-width: 5277px) 100vw, 5277px\" \/><p id=\"caption-attachment-5648398\" class=\"wp-caption-text\">Via code example <a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/08-10-vit32\/08c_fsdp-size-wrap.py\">08c_fsdp-size-wrap.py<\/a><\/p><\/div>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><\/pre>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s \"><strong><span class=\"md-plain\">Understanding Data Parallelism and Tensor Parallelism<\/span><\/strong><\/span><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">In data parallelism, the mini-batch is divided, and a copy of the model is available on each of the GPUs. This process speeds up model training as multiple GPUs work in parallel. <\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648354\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/data-para-new.png\" alt=\"\" width=\"674\" height=\"239\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/data-para-new.png 1470w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/data-para-new-300x107.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/data-para-new-1024x364.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/data-para-new-300x107@2x.png 600w\" sizes=\"(max-width: 674px) 100vw, 674px\" \/><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain md-expand\">Here&#8217;s how it works in a nutshell:<\/span><\/p>\n<ol class=\"ol-list\" start=\"\">\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The same model is replicated across all the GPUs.<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Each GPU is then fed a different subset of the input data (a different mini-batch).<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">All GPUs independently perform forward and backward passes of the model, computing their own local gradients.<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Then, the gradients are collected and averaged across all GPUs.<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The averaged gradients are then used to update the model&#8217;s parameters.<\/span><\/p>\n<\/li>\n<\/ol>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The primary advantage of this approach is speed. Since each GPU is processing a unique mini-batch of data concurrently with the others, the model can be trained on more data in less time. This can significantly reduce the time required to train our model, especially when working with large datasets.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">However, data parallelism has some limitations. Most importantly, each GPU must have a complete copy of the model and its parameters. This places a limit on the size of the model we can train, as the model must fit within a single GPU&#8217;s memory &#8212; this is not feasible for modern ViTs or LLMs.<\/span><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain\">Unlike data parallelism, which involves splitting a mini-batch across multiple devices, tensor parallelism divides the model itself across GPUs. In data parallelism, every GPU needs to fit the entire model, which can be a limitation when training larger models. Tensor parallelism, on the other hand, allows for training models that might be too large for a single GPU by breaking up the model and distributing it across multiple devices.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648352\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/tensor-para-1-new.png\" alt=\"\" width=\"647\" height=\"262\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/tensor-para-1-new.png 1220w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/tensor-para-1-new-300x121.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/tensor-para-1-new-1024x415.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/tensor-para-1-new-300x121@2x.png 600w\" sizes=\"(max-width: 647px) 100vw, 647px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>How does it work? Think of matrix multiplication. There are two ways to distribute it &#8212; by row or by column. For simplicity, let&#8217;s consider distribution by column. For instance, we can break down a large matrix multiplication operation into separate computations, each of which can be carried out on a different GPU, as shown in the figure below. The results are then concatenated to get the original result, effectively distributing the computational load.<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648353\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/tensor-para-2-new.png\" alt=\"\" width=\"687\" height=\"295\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/tensor-para-2-new.png 1216w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/tensor-para-2-new-300x129.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/tensor-para-2-new-1024x440.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/tensor-para-2-new-300x129@2x.png 600w\" sizes=\"(max-width: 687px) 100vw, 687px\" \/><\/p>\n<h2 id=\"toc10\" class=\"md-end-block md-heading md-focus\"><span class=\"md-plain md-expand\">9) Activation Checkpointing<\/span><\/h2>\n<p>To further minimize memory usage during neural network computations, we can add gradient checkpointing (also known as activation checkpointing). This method selectively eliminates certain layer activations during the forward pass and later recalculates them in the backward pass. This approach essentially compromises some computational time to conserve memory.<br \/>\nIn other words, a layer&#8217;s inputs and outputs are retained in memory after the forward pass, but any intermediate tensors that were involved in the computation within the module are released. When the backward pass is computed for these checkpointed modules, the previously cleared tensors are recalculated.<\/p>\n<p>We can use activation checkpointing by adding `activation_checkpointing=EncoderBlock` to our previously used FSDP strategy:<\/p>\n<pre class=\"hljs collapse-false language-python\">auto_wrap_policy = partial(\r\n    transformer_auto_wrap_policy, transformer_layer_cls={EncoderBlock}\r\n)\r\nstrategy = FSDPStrategy(auto_wrap_policy=auto_wrap_policy,\r\n    activation_checkpointing=EncoderBlock\r\n)<\/pre>\n<pre class=\"hljs collapse-false language-python\">fabric = Fabric(accelerator=\"cuda\", \r\n    devices=4, strategy=strategy\r\n)<\/pre>\n<p>This lowers the memory consumption from 17.23 GB to 9.03 GB. However, this slightly increased the runtime from 18.95 min to 22.58 min.<\/p>\n<div id=\"attachment_5648399\" style=\"width: 5312px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648399\" class=\"wp-image-5648399 size-full\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_2_new.png\" alt=\"Via code example 09_fsdp-with-act-checkpointing\" width=\"5302\" height=\"975\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_2_new.png 5302w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_2_new-300x55.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_2_new-1024x188.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_2_new-1536x282.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_2_new-2048x377.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/sharding_2_new-300x55@2x.png 600w\" sizes=\"(max-width: 5302px) 100vw, 5302px\" \/><p id=\"caption-attachment-5648399\" class=\"wp-caption-text\">Via code example <a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/08-10-vit32\/09_fsdp-act-checkp.py\">09_fsdp-with-act-checkpointing<\/a><\/p><\/div>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc11\" class=\"md-end-block md-heading md-focus\"><span class=\"md-plain md-expand\">10) Parameter Offloading<\/span><\/h2>\n<pre class=\"md-end-block md-p hljs collapse-false language-python\"><span class=\"md-plain\">In addition to the FSDP strategy explained in the previous section above, we can also offload optimizer parameters to the CPU, which we can enable by changing\r\n\r\n\r\n\r\nauto_wrap_policy = partial(transformer_auto_wrap_policy, \r\n\u00a0 \u00a0 transformer_layer_cls={EncoderBlock}\r\n)\r\nstrategy = FSDPStrategy(\r\n    auto_wrap_policy=auto_wrap_policy,\r\n    activation_checkpointing=EncoderBlock,\r\n)\r\nfabric = Fabric(accelerator=\"cuda\", devices=4,\r\n    strategy=strategy, precision=\"16-mixed\"\r\n) <\/span><\/pre>\n<p>&nbsp;<\/p>\n<pre class=\"md-end-block md-p hljs collapse-false language-python\"><span class=\"md-plain\">to\r\n\r\n\r\nauto_wrap_policy = partial(transformer_auto_wrap_policy, \r\n\u00a0 \u00a0 transformer_layer_cls={EncoderBlock}\r\n)\r\nstrategy = FSDPStrategy(\r\n    auto_wrap_policy=auto_wrap_policy,\r\n    activation_checkpointing=EncoderBlock,\r\n    cpu_offload=True\r\n)\r\nfabric = Fabric(accelerator=\"cuda\",\r\n    devices=4, strategy=strategy, precision=\"16-mixed\"\r\n) \r\n<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">This reduces the memory consumption from 9.03 GB with activation checkpointing to 6.68 GB with additional CPU offloading. This substantially increased the runtime, however, from 22.58 min to 101.53 min.<br \/>\n<\/span><\/p>\n<div id=\"attachment_5649545\" style=\"width: 1125px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5649545\" class=\"wp-image-5649545\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/multi-gpu-with-caption.png\" alt=\"\" width=\"1115\" height=\"313\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/multi-gpu-with-caption.png 2586w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/multi-gpu-with-caption-300x84.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/multi-gpu-with-caption-1024x287.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/multi-gpu-with-caption-1536x431.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/multi-gpu-with-caption-2048x575.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/multi-gpu-with-caption-300x84@2x.png 600w\" sizes=\"(max-width: 1115px) 100vw, 1115px\" \/><p id=\"caption-attachment-5649545\" class=\"wp-caption-text\">Via code example <a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/08-10-vit32\/10_fsdp-with-cpu-offload.py\">10_fsdp-cpu-offload.py<\/a><\/p><\/div>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading md-focus\"><span id=\"toc12\" class=\"md-plain md-expand\">11) Putting it All Together &amp; Training an LLM<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In the previous sections, we covered a lot of ground by optimizing a vision transformer. Of course, some of you may also want to know whether these techniques apply to LLMs. Of course, they do!<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">We use many of these tricks in our <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/Lightning-AI\/lit-llama\"><span class=\"md-plain\">Lit-LLaMA<\/span><\/a><\/span><span class=\"md-plain\"> and <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\"><span class=\"md-plain\">Lit-GPT<\/span><\/a><\/span><span class=\"md-plain\"> repositories, which support LLaMA, Falcon, Pythia, and other popular models. Still, to create a more general example, we will be finetuning an LLM from the popular HF <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/huggingface\/transformers\"><span class=\"md-pair-s\" spellcheck=\"false\"><code>transformers<\/code><\/span><\/a><\/span><span class=\"md-plain\"> library for classifying the sentiment of IMDb movie reviews.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">For example, if you use the above-mentioned techniques, you can train a <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/1910.01108\"><span class=\"md-plain\">DistilBERT<\/span><\/a><\/span><span class=\"md-plain\"> classifier using only 1.15 Gb memory (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/bonus_distilbert-after.py\"><span class=\"md-plain\">bonus_distilbert-after.py<\/span><\/a><\/span><span class=\"md-plain\">) instead of 3.99 Gb (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/bonus_bigbird-before.py\"><span class=\"md-plain\">bonus_bigbird-before.py<\/span><\/a><\/span><span class=\"md-plain\">).<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Or, more impressively, by applying the techniques to a <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2007.14062\"><span class=\"md-plain\">BigBird<\/span><\/a><\/span><span class=\"md-plain\"> model from the transformers library, BigBird consumes only 4.03 GB (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/pytorch-memory-optim\/blob\/main\/bonus_bigbird-after.py\"><span class=\"md-plain\">bonus_bigbird-after.py<\/span><\/a><\/span><span class=\"md-plain\">)!<\/span><\/p>\n<pre class=\"md-fences md-end-block ty-contain-cm modeLoaded\" lang=\"python\" spellcheck=\"false\"><span role=\"presentation\"> \u00a0<span class=\"cm-variable\">strategy<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">FSDPStrategy<\/span>(<\/span>\r\n<span role=\"presentation\"> \u00a0 \u00a0 \u00a0 \u00a0<span class=\"cm-variable\">cpu_offload<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-keyword\">True<\/span><\/span>\r\n<span role=\"presentation\"> \u00a0 )<\/span>\r\n<span role=\"presentation\">\u200b<\/span>\r\n<span role=\"presentation\"> \u00a0 <span class=\"cm-variable\">fabric<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">Fabric<\/span>(<\/span>\r\n<span role=\"presentation\"> \u00a0 \u00a0 \u00a0 \u00a0<span class=\"cm-variable\">accelerator<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"cuda\"<\/span>,<\/span>\r\n<span role=\"presentation\"> \u00a0 \u00a0 \u00a0 \u00a0<span class=\"cm-variable\">devices<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">4<\/span>,<\/span>\r\n<span role=\"presentation\"> \u00a0 \u00a0 \u00a0 \u00a0<span class=\"cm-variable\">strategy<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-variable\">strategy<\/span>,<\/span>\r\n<span role=\"presentation\"> \u00a0 \u00a0 \u00a0 \u00a0 <span class=\"cm-variable\">precision<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-string\">\"bf16-true\"<\/span><\/span>\r\n<span role=\"presentation\"> \u00a0 )<\/span>\r\n<span role=\"presentation\">\u200b<\/span>\r\n<span role=\"presentation\"> \u00a0 <span class=\"cm-keyword cm-error\">with<\/span> <span class=\"cm-variable\">fabric<\/span>.<span class=\"cm-property\">init_module<\/span>():<\/span>\r\n<span role=\"presentation\"> \u00a0 \u00a0 \u00a0 <span class=\"cm-variable\">model<\/span> <span class=\"cm-operator\">=<\/span> <span class=\"cm-variable\">AutoModelForSequenceClassification<\/span>.<span class=\"cm-property\">from_pretrained<\/span>(<\/span>\r\n<span role=\"presentation\"> \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0<span class=\"cm-string\">\"google\/bigbird-roberta-base\"<\/span>, <span class=\"cm-variable\">num_labels<\/span><span class=\"cm-operator\">=<\/span><span class=\"cm-number\">2<\/span>)<\/span><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">(I would have included the performance without these techniques as a reference, but it&#8217;s not possible to run this model without the abovementioned optimizations.)<\/span><\/p>\n<h2 id=\"toc13\" class=\"md-end-block md-heading\"><span class=\"md-plain\">Conclusion<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">This article showcased 10 techniques to reduce the memory consumption of PyTorch models. When applying these techniques to a vision transformer, we reduced the memory consumption 20x on a single GPU. And we saw that tensor sharding across GPUs could even lower memory consumption. The same optimizations also enabled training a BigBird LLM using only 4 GB of peak GPU RAM.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">None of these techniques are model-specific and can be used with practically any PyTorch training script. And using the open-source Fabric library, most of these optimizations can be enabled with a single line of code.<\/span><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain\">If you found this article useful, please share it with your colleagues. Also, if you have additional techniques I haven&#8217;t covered here, please feel free to reach out via social media (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/twitter.com\/LightningAI\"><span class=\"md-plain\">@LightningAI<\/span><\/a><\/span><span class=\"md-plain\"> or <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/twitter.com\/rasbt\"><span class=\"md-plain\">@rasbt<\/span><\/a><\/span><span class=\"md-plain\">) or discuss this further in our <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/discord.com\/invite\/XncpTy7DSt\"><span class=\"md-plain\">Discord channel<\/span><\/a><\/span><span class=\"md-plain md-expand\">.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction In this article, we will be exploring 10 easily-accessible techniques to reduce memory usage in PyTorch. These techniques are cumulative, meaning we can apply them on top of one another. We will begin working with a vision transformer from PyTorch&#8217;s Torchvision library to provide simple code examples that you can execute on your own<a class=\"excerpt-read-more\" href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/\" title=\"ReadOptimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch\">&#8230; Read more &raquo;<\/a><\/p>\n","protected":false},"author":16,"featured_media":5648355,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[29,106,41],"tags":[179,188,235,51,236,234],"glossary":[232],"acf":{"additional_authors":false,"mathjax":false,"default_editor":true,"show_table_of_contents":true,"hide_from_archive":false,"content_type":"Blog Post","sticky":false,"custom_styles":"","code_embed":false,"tabs":false,"table_of_contents":"<h4>Table of Contents<\/h4>\n<ul>\n<li><a href=\"#toc1\">Introduction<\/a><\/li>\n<li><a href=\"#toc2\">Finetuning a Vision Transformer<\/a><\/li>\n<li><a href=\"#toc3\">Automatic Mixed-Precision<\/a><\/li>\n<li><a href=\"#toc4\">Lower-Precision Training<\/a><\/li>\n<li><a href=\"#toc5\">Reducing the Batchsize<\/a><\/li>\n<li><a href=\"#toc6\">Using Gradient Accumulation to Create Microbatches<\/a><\/li>\n<li><a href=\"#toc7\">Using a Leaner Optimizer<\/a><\/li>\n<li><a href=\"#toc8\">Creating the Model on the Target Device with Desired Precision<\/a><\/li>\n<li><a href=\"#toc9\">Distributed Training and Tensor Sharding<\/a><\/li>\n<li><a href=\"#toc10\">Activation Checkpointing<\/a><\/li>\n<li><a href=\"#toc11\">Parameter Offloading<\/a><\/li>\n<li><a href=\"#toc12\">Putting it All Together &amp; Training an LLM<\/a><\/li>\n<li style=\"border-bottom: 0px solid;\"><a href=\"#toc13\">Conclusion<\/a><\/li>\n<\/ul>\n<style>h2{scroll-margin-top:100px; scroll-padding-top:100px;}<\/style>\n"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch - Lightning AI<\/title>\n<meta name=\"description\" content=\"This article provides a series of techniques that can lower memory consumption in PyTorch (when training vision transformers and LLMs) by approximately 20x without sacrificing modeling performance and prediction accuracy.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch - Lightning AI\" \/>\n<meta property=\"og:description\" content=\"This article provides a series of techniques that can lower memory consumption in PyTorch (when training vision transformers and LLMs) by approximately 20x without sacrificing modeling performance and prediction accuracy.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/\" \/>\n<meta property=\"og:site_name\" content=\"Lightning AI\" \/>\n<meta property=\"article:published_time\" content=\"2023-07-02T10:12:52+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2024-02-01T14:50:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/pytorch-memory-hero.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2572\" \/>\n\t<meta property=\"og:image:height\" content=\"1370\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"JP Hennessy\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:site\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"JP Hennessy\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"17 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/\"},\"author\":{\"name\":\"JP Hennessy\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\"},\"headline\":\"Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch\",\"datePublished\":\"2023-07-02T10:12:52+00:00\",\"dateModified\":\"2024-02-01T14:50:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/\"},\"wordCount\":3010,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/pytorch-memory-hero.png\",\"keywords\":[\"fabric\",\"LLMs\",\"Memory-efficiency\",\"pytorch\",\"scaling\",\"vision transformers\"],\"articleSection\":[\"Blog\",\"Community\",\"Tutorials\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/\",\"url\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/\",\"name\":\"Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch - Lightning AI\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/pytorch-memory-hero.png\",\"datePublished\":\"2023-07-02T10:12:52+00:00\",\"dateModified\":\"2024-02-01T14:50:34+00:00\",\"description\":\"This article provides a series of techniques that can lower memory consumption in PyTorch (when training vision transformers and LLMs) by approximately 20x without sacrificing modeling performance and prediction accuracy.\",\"breadcrumb\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#primaryimage\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/pytorch-memory-hero.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/pytorch-memory-hero.png\",\"width\":2572,\"height\":1370},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/lightning.ai\/pages\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/lightning.ai\/pages\/#website\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"name\":\"Lightning AI\",\"description\":\"The platform for teams to build AI.\",\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/lightning.ai\/pages\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\",\"name\":\"Lightning AI\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"width\":1744,\"height\":856,\"caption\":\"Lightning AI\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/LightningAI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\",\"name\":\"JP Hennessy\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"caption\":\"JP Hennessy\"},\"url\":\"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch - Lightning AI","description":"This article provides a series of techniques that can lower memory consumption in PyTorch (when training vision transformers and LLMs) by approximately 20x without sacrificing modeling performance and prediction accuracy.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/","og_locale":"en_US","og_type":"article","og_title":"Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch - Lightning AI","og_description":"This article provides a series of techniques that can lower memory consumption in PyTorch (when training vision transformers and LLMs) by approximately 20x without sacrificing modeling performance and prediction accuracy.","og_url":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/","og_site_name":"Lightning AI","article_published_time":"2023-07-02T10:12:52+00:00","article_modified_time":"2024-02-01T14:50:34+00:00","og_image":[{"width":2572,"height":1370,"url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/pytorch-memory-hero.png","type":"image\/png"}],"author":"JP Hennessy","twitter_card":"summary_large_image","twitter_creator":"@LightningAI","twitter_site":"@LightningAI","twitter_misc":{"Written by":"JP Hennessy","Est. reading time":"17 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#article","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/"},"author":{"name":"JP Hennessy","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6"},"headline":"Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch","datePublished":"2023-07-02T10:12:52+00:00","dateModified":"2024-02-01T14:50:34+00:00","mainEntityOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/"},"wordCount":3010,"commentCount":0,"publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/pytorch-memory-hero.png","keywords":["fabric","LLMs","Memory-efficiency","pytorch","scaling","vision transformers"],"articleSection":["Blog","Community","Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/","url":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/","name":"Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch - Lightning AI","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/#website"},"primaryImageOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#primaryimage"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/pytorch-memory-hero.png","datePublished":"2023-07-02T10:12:52+00:00","dateModified":"2024-02-01T14:50:34+00:00","description":"This article provides a series of techniques that can lower memory consumption in PyTorch (when training vision transformers and LLMs) by approximately 20x without sacrificing modeling performance and prediction accuracy.","breadcrumb":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#primaryimage","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/pytorch-memory-hero.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/07\/pytorch-memory-hero.png","width":2572,"height":1370},{"@type":"BreadcrumbList","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/pytorch-memory-vit-llm\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/lightning.ai\/pages\/"},{"@type":"ListItem","position":2,"name":"Optimizing Memory Usage for Training LLMs and Vision Transformers in PyTorch"}]},{"@type":"WebSite","@id":"https:\/\/lightning.ai\/pages\/#website","url":"https:\/\/lightning.ai\/pages\/","name":"Lightning AI","description":"The platform for teams to build AI.","publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/lightning.ai\/pages\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/lightning.ai\/pages\/#organization","name":"Lightning AI","url":"https:\/\/lightning.ai\/pages\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","width":1744,"height":856,"caption":"Lightning AI"},"image":{"@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/LightningAI"]},{"@type":"Person","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6","name":"JP Hennessy","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","caption":"JP Hennessy"},"url":"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/"}]}},"_links":{"self":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5648335"}],"collection":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/comments?post=5648335"}],"version-history":[{"count":0,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5648335\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media\/5648355"}],"wp:attachment":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media?parent=5648335"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/categories?post=5648335"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/tags?post=5648335"},{"taxonomy":"glossary","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/glossary?post=5648335"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}