{"id":5647630,"date":"2023-03-28T13:47:39","date_gmt":"2023-03-28T17:47:39","guid":{"rendered":"https:\/\/lightning.ai\/pages\/?p=5647630"},"modified":"2023-04-13T07:01:16","modified_gmt":"2023-04-13T11:01:16","slug":"gradient-accumulation","status":"publish","type":"post","link":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/","title":{"rendered":"Finetuning LLMs on a Single GPU Using Gradient Accumulation"},"content":{"rendered":"<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Key takeaway<\/h3> Learn how to use gradient accumulation to train models with large batch sizes in order to work around hardware limitations when GPU memory is a concern. <\/div>\n<p>Previously, <a href=\"https:\/\/sebastianraschka.com\/blog\/2023\/pytorch-faster.html\">I shared an article using multi-GPU training strategies to speed up the finetuning of large language models<\/a>. Several of these strategies include mechanisms such as model or tensor sharding that distributes the model weights and computations across different devices to work around GPU memory limitations.<\/p>\n<p>However, many of us don&#8217;t have access to multi-GPU resources. This article therefore demonstrates a great workaround to train models with larger batch sizes when GPU memory is a concern: gradient accumulation.<\/p>\n<p>&nbsp;<\/p>\n<h2><strong>Let&#8217;s Finetune BLOOM for Classification<\/strong><\/h2>\n<p>Let&#8217;s suppose we are interested in adopting a recent pretrained large language model for a downstream task such as text classification. We are going to work with <a href=\"https:\/\/arxiv.org\/abs\/2211.05100\">BLOOM<\/a>, which is an open-source alternative to GPT-3. In particular, we are going to use a version of BLOOM that &#8220;only&#8221; has 560 million parameters &#8212; it should fit into the RAM of conventional GPUs without problems (for reference, the free tier of Google Colab has a GPU with 15 Gb of RAM.)<\/p>\n<p>Once we start, however, we bump into problems: our memory explodes during training or finetuning; we find that the only way to train this model is using a batch size of 1.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-5647632\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/bloom-image-1.png\" alt=\"\" width=\"528\" height=\"352\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/bloom-image-1.png 1650w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/bloom-image-1-300x200.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/bloom-image-1-1024x683.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/bloom-image-1-1536x1024.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/bloom-image-1-300x200@2x.png 600w\" sizes=\"(max-width: 528px) 100vw, 528px\" \/><\/p>\n<p>&nbsp;<\/p>\n<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Note<\/h3> The code for finetuning BLOOM for a target classification task using a batch size of 1 is shown below. (You can also <a href=\"https:\/\/github.com\/rasbt\/gradient-accumulation-blog\/blob\/main\/src\/1_batchsize-1.py\">download the complete code from GitHub here<\/a>.)<\/p>\n<p>You can copy &amp; paste this code directly into Google Colab. However, you also have to drag and drop <a href=\"https:\/\/github.com\/rasbt\/gradient-accumulation-blog\/blob\/main\/src\/local_dataset_utilities.py\">the accompanying local_dataset_utilities.py<\/a> file into the same folder as we import some dataset utilities from this file.<\/p>\n<p><\/div>\n<pre class=\"code-shortcode dark-theme window- collapse-600 \" style=\"--height:600px\"><code class=\"language-python\">\n\n# pip install torch lightning matplotlib pandas torchmetrics watermark transformers datasets -U\n\nimport os<br \/>\nimport os.path as op<br \/>\nimport time\n\nfrom datasets import load_dataset<br \/>\nfrom lightning import Fabric<br \/>\nimport torch<br \/>\nfrom torch.utils.data import DataLoader<br \/>\nimport torchmetrics<br \/>\nfrom transformers import AutoTokenizer<br \/>\nfrom transformers import AutoModelForSequenceClassification<br \/>\nfrom watermark import watermark\n\nfrom local_dataset_utilities import download_dataset, load_dataset_into_to_dataframe, partition_dataset<br \/>\nfrom local_dataset_utilities import IMDBDataset\n\ndef tokenize_text(batch):<br \/>\n    return tokenizer(batch[\"text\"], truncation=True, padding=True, max_length=1024)\n\ndef train(num_epochs, model, optimizer, train_loader, val_loader, fabric):\n\n    for epoch in range(num_epochs):<br \/>\n        train_acc = torchmetrics.Accuracy(<br \/>\n            task=\"multiclass\", num_classes=2).to(fabric.device)\n\n        for batch_idx, batch in enumerate(train_loader):<br \/>\n            model.train()\n\n            ### FORWARD AND BACK PROP<br \/>\n            outputs = model(<br \/>\n                batch[\"input_ids\"],<br \/>\n                attention_mask=batch[\"attention_mask\"],<br \/>\n                labels=batch[\"label\"]<br \/>\n            ) \n\n            fabric.backward(outputs[\"loss\"])\n\n            ### UPDATE MODEL PARAMETERS<br \/>\n            optimizer.step()<br \/>\n            optimizer.zero_grad()\n\n            ### LOGGING<br \/>\n            if not batch_idx % 300:<br \/>\n                print(f\"Epoch: {epoch+1:04d}\/{num_epochs:04d} \"<br \/>\n                      f\"| Batch {batch_idx:04d}\/{len(train_loader):04d} \"<br \/>\n                      f\"| Loss: {outputs['loss']:.4f}\")\n\n            model.eval()<br \/>\n            with torch.no_grad():<br \/>\n                predicted_labels = torch.argmax(outputs[\"logits\"], 1)<br \/>\n                train_acc.update(predicted_labels, batch[\"label\"])\n\n        ### MORE LOGGING<br \/>\n        model.eval()<br \/>\n        with torch.no_grad():<br \/>\n            val_acc = torchmetrics.Accuracy(task=\"multiclass\", num_classes=2).to(fabric.device)<br \/>\n            for batch in val_loader:<br \/>\n                outputs = model(<br \/>\n                    batch[\"input_ids\"],<br \/>\n                    attention_mask=batch[\"attention_mask\"],<br \/>\n                    labels=batch[\"label\"]<br \/>\n                )<br \/>\n                predicted_labels = torch.argmax(outputs[\"logits\"], 1)<br \/>\n                val_acc.update(predicted_labels, batch[\"label\"])\n\n            print(f\"Epoch: {epoch+1:04d}\/{num_epochs:04d} \"<br \/>\n                  f\"| Train acc.: {train_acc.compute()*100:.2f}% \"<br \/>\n                  f\"| Val acc.: {val_acc.compute()*100:.2f}%\"<br \/>\n                  )<br \/>\n            train_acc.reset(), val_acc.reset()\n\nif __name__ == \"__main__\":\n\n    print(watermark(packages=\"torch,lightning,transformers\", python=True))<br \/>\n    print(\"Torch CUDA available?\", torch.cuda.is_available())<br \/>\n    device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n\n    torch.manual_seed(123)<br \/>\n    # torch.use_deterministic_algorithms(True)\n\n    ##########################<br \/>\n    ### 1 Loading the Dataset<br \/>\n    ##########################<br \/>\n    download_dataset()<br \/>\n    df = load_dataset_into_to_dataframe()<br \/>\n    if not (op.exists(\"train.csv\") and op.exists(\"val.csv\") and op.exists(\"test.csv\")):<br \/>\n        partition_dataset(df)\n\n    imdb_dataset = load_dataset(<br \/>\n        \"csv\",<br \/>\n        data_files={<br \/>\n            \"train\": \"train.csv\",<br \/>\n            \"validation\": \"val.csv\",<br \/>\n            \"test\": \"test.csv\",<br \/>\n        },<br \/>\n    )\n\n    #########################################<br \/>\n    ### 2 Tokenization and Numericalization<br \/>\n    #########################################\n\n    tokenizer = AutoTokenizer.from_pretrained(\"bigscience\/bloom-560m\", max_length=1024)<br \/>\n    print(\"Tokenizer input max length:\", tokenizer.model_max_length, flush=True)<br \/>\n    print(\"Tokenizer vocabulary size:\", tokenizer.vocab_size, flush=True)\n\n    print(\"Tokenizing ...\", flush=True)<br \/>\n    imdb_tokenized = imdb_dataset.map(tokenize_text, batched=True, batch_size=None)<br \/>\n    del imdb_dataset<br \/>\n    imdb_tokenized.set_format(\"torch\", columns=[\"input_ids\", \"attention_mask\", \"label\"])<br \/>\n    os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"\n\n    #########################################<br \/>\n    ### 3 Set Up DataLoaders<br \/>\n    #########################################\n\n    train_dataset = IMDBDataset(imdb_tokenized, partition_key=\"train\")<br \/>\n    val_dataset = IMDBDataset(imdb_tokenized, partition_key=\"validation\")<br \/>\n    test_dataset = IMDBDataset(imdb_tokenized, partition_key=\"test\")\n\n    train_loader = DataLoader(<br \/>\n        dataset=train_dataset,<br \/>\n        batch_size=1,<br \/>\n        shuffle=True,<br \/>\n        num_workers=4,<br \/>\n        drop_last=True,<br \/>\n    )\n\n    val_loader = DataLoader(<br \/>\n        dataset=val_dataset,<br \/>\n        batch_size=1,<br \/>\n        num_workers=4,<br \/>\n        drop_last=True,<br \/>\n    )\n\n    test_loader = DataLoader(<br \/>\n        dataset=test_dataset,<br \/>\n        batch_size=1,<br \/>\n        num_workers=2,<br \/>\n        drop_last=True,<br \/>\n    )\n\n    #########################################<br \/>\n    ### 4 Initializing the Model<br \/>\n    #########################################\n\n    fabric = Fabric(accelerator=\"cuda\", devices=1, precision=\"16-mixed\")<br \/>\n    fabric.launch()\n\n    model = AutoModelForSequenceClassification.from_pretrained(<br \/>\n        \"bigscience\/bloom-560m\", num_labels=2)\n\n    optimizer = torch.optim.Adam(model.parameters(), lr=5e-5)\n\n    model, optimizer = fabric.setup(model, optimizer)<br \/>\n    train_loader, val_loader, test_loader = fabric.setup_dataloaders(<br \/>\n        train_loader, val_loader, test_loader)\n\n    #########################################<br \/>\n    ### 5 Finetuning<br \/>\n    #########################################\n\n    start = time.time()<br \/>\n    train(<br \/>\n        num_epochs=1,<br \/>\n        model=model,<br \/>\n        optimizer=optimizer,<br \/>\n        train_loader=train_loader,<br \/>\n        val_loader=val_loader,<br \/>\n        fabric=fabric,<br \/>\n    )\n\n    end = time.time()<br \/>\n    elapsed = end-start<br \/>\n    print(f\"Time elapsed {elapsed\/60:.2f} min\")\n\n    with torch.no_grad():<br \/>\n        model.eval()<br \/>\n        test_acc = torchmetrics.Accuracy(task=\"multiclass\", num_classes=2).to(fabric.device)<br \/>\n        for batch in test_loader:<br \/>\n            outputs = model(<br \/>\n                batch[\"input_ids\"],<br \/>\n                attention_mask=batch[\"attention_mask\"],<br \/>\n                labels=batch[\"label\"]<br \/>\n            )<br \/>\n            predicted_labels = torch.argmax(outputs[\"logits\"], 1)<br \/>\n            test_acc.update(predicted_labels, batch[\"label\"])\n\n    print(f\"Test accuracy {test_acc.compute()*100:.2f}%\")<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>&nbsp;<\/p>\n<p>I am using <a href=\"https:\/\/lightning.ai\/fabric\">Lightning Fabric<\/a> because it allows me to flexibly change the number of GPUs and multi-GPU training strategy when running this code on different hardware. It also lets me enable mixed-precision training by only adjusting the precision flag. In this case, mixed-precision training can triple the training speed and reduce memory requirements by roughly 25%.<\/p>\n<p>The main code shown above is executed in the <code>if __name__ == \"__main__\"<\/code> context, which is recommended when running Python scripts for multi-GPU training with PyTorch &#8212; even though we are only using a single GPU, it&#8217;s a best practice that we adopt. Then, the following three code sections within the <code>if __name__ == \"__main__\"<\/code>, take care of the data loading:<\/p>\n<p><code># 1 Loading the Dataset<\/code><\/p>\n<p><code># 2 Tokenization and Numericalization<\/code><\/p>\n<p><code># 3 Setting Up DataLoaders<\/code><\/p>\n<p>In section <code># 4 Initializing the Model<\/code>, we initialize the model. Then, in section <code># 5 Finetuning<\/code>, we call the train function, which is where things get interesting. In the <code>train(...)<\/code> function, we implement our standard PyTorch loop. An annotated version of the core training loop is shown below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-5647633\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Untitled-5.png\" alt=\"\" width=\"636\" height=\"589\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Untitled-5.png 1356w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Untitled-5-300x278.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Untitled-5-1024x948.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Untitled-5-300x278@2x.png 600w\" sizes=\"(max-width: 636px) 100vw, 636px\" \/><\/p>\n<p>The problem with batch sizes of 1 is that the gradient updates will be extremely noisy, as we can see based on the fluctuating training loss and poor test set performance below when we train the model:<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\n\n...<br \/>\ntorch : 2.0.0<br \/>\nlightning : 2.0.0<br \/>\ntransformers: 4.27.2\n\nTorch CUDA available? True<br \/>\n...<br \/>\nEpoch: 0001\/0001 | Batch 23700\/35000 | Loss: 0.0969<br \/>\nEpoch: 0001\/0001 | Batch 24000\/35000 | Loss: 1.9902<br \/>\nEpoch: 0001\/0001 | Batch 24300\/35000 | Loss: 0.0395<br \/>\nEpoch: 0001\/0001 | Batch 24600\/35000 | Loss: 0.2546<br \/>\nEpoch: 0001\/0001 | Batch 24900\/35000 | Loss: 0.1128<br \/>\nEpoch: 0001\/0001 | Batch 25200\/35000 | Loss: 0.2661<br \/>\nEpoch: 0001\/0001 | Batch 25500\/35000 | Loss: 0.0044<br \/>\nEpoch: 0001\/0001 | Batch 25800\/35000 | Loss: 0.0067<br \/>\nEpoch: 0001\/0001 | Batch 26100\/35000 | Loss: 0.0468<br \/>\nEpoch: 0001\/0001 | Batch 26400\/35000 | Loss: 1.7139<br \/>\nEpoch: 0001\/0001 | Batch 26700\/35000 | Loss: 0.9570<br \/>\nEpoch: 0001\/0001 | Batch 27000\/35000 | Loss: 0.1857<br \/>\nEpoch: 0001\/0001 | Batch 27300\/35000 | Loss: 0.0090<br \/>\nEpoch: 0001\/0001 | Batch 27600\/35000 | Loss: 0.9790<br \/>\nEpoch: 0001\/0001 | Batch 27900\/35000 | Loss: 0.0503<br \/>\nEpoch: 0001\/0001 | Batch 28200\/35000 | Loss: 0.2625<br \/>\nEpoch: 0001\/0001 | Batch 28500\/35000 | Loss: 0.1010<br \/>\nEpoch: 0001\/0001 | Batch 28800\/35000 | Loss: 0.0035<br \/>\nEpoch: 0001\/0001 | Batch 29100\/35000 | Loss: 0.0009<br \/>\nEpoch: 0001\/0001 | Batch 29400\/35000 | Loss: 0.0234<br \/>\nEpoch: 0001\/0001 | Batch 29700\/35000 | Loss: 0.8394<br \/>\nEpoch: 0001\/0001 | Batch 30000\/35000 | Loss: 0.9497<br \/>\nEpoch: 0001\/0001 | Batch 30300\/35000 | Loss: 0.1437<br \/>\nEpoch: 0001\/0001 | Batch 30600\/35000 | Loss: 0.1317<br \/>\nEpoch: 0001\/0001 | Batch 30900\/35000 | Loss: 0.0112<br \/>\nEpoch: 0001\/0001 | Batch 31200\/35000 | Loss: 0.0073<br \/>\nEpoch: 0001\/0001 | Batch 31500\/35000 | Loss: 0.7393<br \/>\nEpoch: 0001\/0001 | Batch 31800\/35000 | Loss: 0.0512<br \/>\nEpoch: 0001\/0001 | Batch 32100\/35000 | Loss: 0.1337<br \/>\nEpoch: 0001\/0001 | Batch 32400\/35000 | Loss: 1.1875<br \/>\nEpoch: 0001\/0001 | Batch 32700\/35000 | Loss: 0.2727<br \/>\nEpoch: 0001\/0001 | Batch 33000\/35000 | Loss: 0.1545<br \/>\nEpoch: 0001\/0001 | Batch 33300\/35000 | Loss: 0.0022<br \/>\nEpoch: 0001\/0001 | Batch 33600\/35000 | Loss: 0.2681<br \/>\nEpoch: 0001\/0001 | Batch 33900\/35000 | Loss: 0.2467<br \/>\nEpoch: 0001\/0001 | Batch 34200\/35000 | Loss: 0.0620<br \/>\nEpoch: 0001\/0001 | Batch 34500\/35000 | Loss: 2.5039<br \/>\nEpoch: 0001\/0001 | Batch 34800\/35000 | Loss: 0.0131<br \/>\nEpoch: 0001\/0001 | Train acc.: 75.11% | Val acc.: 78.62%<br \/>\nTime elapsed 69.97 min<br \/>\nTest accuracy 78.53%\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>Since we don&#8217;t have multiple GPUs available for tensor sharding, what can we do to train the model with larger batch sizes?<\/p>\n<p>One workaround is gradient accumulation, where we modify the aforementioned training loop.<\/p>\n<p>&nbsp;<\/p>\n<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">What is gradient accumulation?<\/h3> Gradient accumulation is a way to virtually increase the batch size during training, which is very useful when the available GPU memory is insufficient to accommodate the desired batch size. In gradient accumulation, gradients are computed for smaller batches and accumulated (usually summed or averaged) over multiple iterations instead of updating the model weights after every batch. Once the accumulated gradients reach the target &#8220;virtual&#8221; batch size, the model weights are updated with the accumulated gradients.<\/p>\n<p>To illustrate this, consider the updated PyTorch training loop below. (<a href=\"https:\/\/github.com\/rasbt\/gradient-accumulation-blog\/blob\/main\/src\/2_batchsize-16.py\">The full script is available here on GitHub.<\/a>) <\/div>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-5647634\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-27-at-11.35.47-AM.png\" alt=\"\" width=\"593\" height=\"577\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-27-at-11.35.47-AM.png 1356w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-27-at-11.35.47-AM-300x292.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-27-at-11.35.47-AM-1024x997.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Screenshot-2023-03-27-at-11.35.47-AM-300x292@2x.png 600w\" sizes=\"(max-width: 593px) 100vw, 593px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>If we set <code>accumulation_steps<\/code> to 2, then <code>zero_grad()<\/code> and <code>optimizer.step()<\/code> will only be called every second epoch. Consequently, running the modified training loop with <code>accumulation_steps=2<\/code> will have the same effect as doubling the batch size.<\/p>\n<p>For example, if we want to use a batch size of 256 but can only fit a batch size of 64 into GPU memory, we can perform gradient accumulation over four batches of size 64. (After processing all four batches, we will have the accumulated gradients equivalent to a single batch of size 256.) This allows us to effectively emulate a larger batch size without requiring larger GPU memory or tensor sharding across different devices.<\/p>\n<p>While gradient accumulation can help us train models with larger batch sizes, it does not reduce the total computation required. In fact, it can sometimes lead to a slightly slower training process, as the weight updates are performed less frequently. Nevertheless, it allows us to work around limitations where we have very small batch sizes that lead to noisy updates.<\/p>\n<p>For example, let&#8217;s now run the code from above, where we have a batch size of 1, with 16 accumulation steps to simulate a batch size of 16. You can download the code here.<\/p>\n<p>The output is as follows:<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\n\n...<br \/>\ntorch : 2.0.0<br \/>\nlightning : 2.0.0<br \/>\ntransformers: 4.27.2\n\nTorch CUDA available? True<br \/>\n...<br \/>\nEpoch: 0001\/0001 | Batch 23700\/35000 | Loss: 0.0168<br \/>\nEpoch: 0001\/0001 | Batch 24000\/35000 | Loss: 0.0006<br \/>\nEpoch: 0001\/0001 | Batch 24300\/35000 | Loss: 0.0152<br \/>\nEpoch: 0001\/0001 | Batch 24600\/35000 | Loss: 0.0003<br \/>\nEpoch: 0001\/0001 | Batch 24900\/35000 | Loss: 0.0623<br \/>\nEpoch: 0001\/0001 | Batch 25200\/35000 | Loss: 0.0010<br \/>\nEpoch: 0001\/0001 | Batch 25500\/35000 | Loss: 0.0001<br \/>\nEpoch: 0001\/0001 | Batch 25800\/35000 | Loss: 0.0047<br \/>\nEpoch: 0001\/0001 | Batch 26100\/35000 | Loss: 0.0004<br \/>\nEpoch: 0001\/0001 | Batch 26400\/35000 | Loss: 0.1016<br \/>\nEpoch: 0001\/0001 | Batch 26700\/35000 | Loss: 0.0021<br \/>\nEpoch: 0001\/0001 | Batch 27000\/35000 | Loss: 0.0015<br \/>\nEpoch: 0001\/0001 | Batch 27300\/35000 | Loss: 0.0008<br \/>\nEpoch: 0001\/0001 | Batch 27600\/35000 | Loss: 0.0060<br \/>\nEpoch: 0001\/0001 | Batch 27900\/35000 | Loss: 0.0001<br \/>\nEpoch: 0001\/0001 | Batch 28200\/35000 | Loss: 0.0426<br \/>\nEpoch: 0001\/0001 | Batch 28500\/35000 | Loss: 0.0012<br \/>\nEpoch: 0001\/0001 | Batch 28800\/35000 | Loss: 0.0025<br \/>\nEpoch: 0001\/0001 | Batch 29100\/35000 | Loss: 0.0025<br \/>\nEpoch: 0001\/0001 | Batch 29400\/35000 | Loss: 0.0000<br \/>\nEpoch: 0001\/0001 | Batch 29700\/35000 | Loss: 0.0495<br \/>\nEpoch: 0001\/0001 | Batch 30000\/35000 | Loss: 0.0164<br \/>\nEpoch: 0001\/0001 | Batch 30300\/35000 | Loss: 0.0067<br \/>\nEpoch: 0001\/0001 | Batch 30600\/35000 | Loss: 0.0037<br \/>\nEpoch: 0001\/0001 | Batch 30900\/35000 | Loss: 0.0005<br \/>\nEpoch: 0001\/0001 | Batch 31200\/35000 | Loss: 0.0013<br \/>\nEpoch: 0001\/0001 | Batch 31500\/35000 | Loss: 0.0112<br \/>\nEpoch: 0001\/0001 | Batch 31800\/35000 | Loss: 0.0053<br \/>\nEpoch: 0001\/0001 | Batch 32100\/35000 | Loss: 0.0012<br \/>\nEpoch: 0001\/0001 | Batch 32400\/35000 | Loss: 0.1365<br \/>\nEpoch: 0001\/0001 | Batch 32700\/35000 | Loss: 0.0210<br \/>\nEpoch: 0001\/0001 | Batch 33000\/35000 | Loss: 0.0374<br \/>\nEpoch: 0001\/0001 | Batch 33300\/35000 | Loss: 0.0007<br \/>\nEpoch: 0001\/0001 | Batch 33600\/35000 | Loss: 0.0341<br \/>\nEpoch: 0001\/0001 | Batch 33900\/35000 | Loss: 0.0259<br \/>\nEpoch: 0001\/0001 | Batch 34200\/35000 | Loss: 0.0005<br \/>\nEpoch: 0001\/0001 | Batch 34500\/35000 | Loss: 0.4792<br \/>\nEpoch: 0001\/0001 | Batch 34800\/35000 | Loss: 0.0003<br \/>\nEpoch: 0001\/0001 | Train acc.: 78.67% | Val acc.: 87.28%<br \/>\nTime elapsed 51.37 min<br \/>\nTest accuracy 87.37%\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>As we can see, based on the results above, the loss fluctuates less than before. In addition, the test set performance increased by 10%! We are only iterating through the training set once, so each training example is only encountered a single time. Training the model for multiple epochs can further improve the predictive performance, but I&#8217;ll leave this as an exercise for you to try out (and let me know how it goes on <a href=\"https:\/\/discord.gg\/tfXFetEZxv\">Discord<\/a>!).<\/p>\n<p>You may have also noticed that this code also executed faster than the code we used previously with a batch size of 1. If we increase the virtual batch size to 8 using gradient accumulation, we still have the same number of forward passes. However, since we only update the model every eighth epoch, we have fewer backward passes, which lets us iterate through the examples in a single epoch faster.<\/p>\n<p>&nbsp;<\/p>\n<h2><strong>Conclusion<\/strong><\/h2>\n<p>Gradient accumulation is a technique that simulates a larger batch size by accumulating gradients from multiple small batches before performing a weight update. This technique can be helpful in scenarios where the available memory is limited, and the batch size that can fit in memory is small.<\/p>\n<p>However, consider a scenario in which you can run the batch size in the first place, meaning the available memory is large enough to accommodate the desired batch size. In that case, gradient accumulation may not be necessary. In fact, running a larger batch size can be more efficient because it allows for more parallelism and reduces the number of weight updates required to train the model.<\/p>\n<p>In summary, gradient accumulation can be a useful technique for reducing the impact of noise in small batch sizes on the accuracy of gradient updates. It&#8217;s a simple yet effective technique that lets us work around hardware limitations.<\/p>\n<p>For reference, all code accompanying this blog post is available <a href=\"https:\/\/github.com\/rasbt\/gradient-accumulation-blog\/tree\/main\/src\">here<\/a> on GitHub.<\/p>\n<p>&nbsp;<\/p>\n<h2><strong>PS: Can we make this run even faster?<\/strong><\/h2>\n<p>Yes! We can make it run even faster using <code>torch.compile<\/code> introduced in PyTorch 2.0. All it takes is a little addition of <code>model = torch.compile<\/code>, as shown in the figure below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-5647655\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Figure1-1024x712.png\" alt=\"\" width=\"586\" height=\"408\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Figure1-1024x712.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Figure1-300x208.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Figure1-1536x1067.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Figure1.png 1586w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/Figure1-300x208@2x.png 600w\" sizes=\"(max-width: 586px) 100vw, 586px\" \/><a href=\"https:\/\/github.com\/rasbt\/gradient-accumulation-blog\/blob\/main\/src\/3_batchsize-8.py\">The full script is available on GitHub.<\/a><\/p>\n<p>In this case, <code>torch.compile<\/code> shaves off another ten minutes of training without impacting the modeling performance:<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\npoch: 0001\/0001 | Batch 26400\/35000 | Loss: 0.0320<br \/>\nEpoch: 0001\/0001 | Batch 26700\/35000 | Loss: 0.0010<br \/>\nEpoch: 0001\/0001 | Batch 27000\/35000 | Loss: 0.0006<br \/>\nEpoch: 0001\/0001 | Batch 27300\/35000 | Loss: 0.0015<br \/>\nEpoch: 0001\/0001 | Batch 27600\/35000 | Loss: 0.0157<br \/>\nEpoch: 0001\/0001 | Batch 27900\/35000 | Loss: 0.0015<br \/>\nEpoch: 0001\/0001 | Batch 28200\/35000 | Loss: 0.0540<br \/>\nEpoch: 0001\/0001 | Batch 28500\/35000 | Loss: 0.0035<br \/>\nEpoch: 0001\/0001 | Batch 28800\/35000 | Loss: 0.0016<br \/>\nEpoch: 0001\/0001 | Batch 29100\/35000 | Loss: 0.0015<br \/>\nEpoch: 0001\/0001 | Batch 29400\/35000 | Loss: 0.0008<br \/>\nEpoch: 0001\/0001 | Batch 29700\/35000 | Loss: 0.0877<br \/>\nEpoch: 0001\/0001 | Batch 30000\/35000 | Loss: 0.0232<br \/>\nEpoch: 0001\/0001 | Batch 30300\/35000 | Loss: 0.0014<br \/>\nEpoch: 0001\/0001 | Batch 30600\/35000 | Loss: 0.0032<br \/>\nEpoch: 0001\/0001 | Batch 30900\/35000 | Loss: 0.0004<br \/>\nEpoch: 0001\/0001 | Batch 31200\/35000 | Loss: 0.0062<br \/>\nEpoch: 0001\/0001 | Batch 31500\/35000 | Loss: 0.0032<br \/>\nEpoch: 0001\/0001 | Batch 31800\/35000 | Loss: 0.0066<br \/>\nEpoch: 0001\/0001 | Batch 32100\/35000 | Loss: 0.0017<br \/>\nEpoch: 0001\/0001 | Batch 32400\/35000 | Loss: 0.1485<br \/>\nEpoch: 0001\/0001 | Batch 32700\/35000 | Loss: 0.0324<br \/>\nEpoch: 0001\/0001 | Batch 33000\/35000 | Loss: 0.0155<br \/>\nEpoch: 0001\/0001 | Batch 33300\/35000 | Loss: 0.0007<br \/>\nEpoch: 0001\/0001 | Batch 33600\/35000 | Loss: 0.0049<br \/>\nEpoch: 0001\/0001 | Batch 33900\/35000 | Loss: 0.1170<br \/>\nEpoch: 0001\/0001 | Batch 34200\/35000 | Loss: 0.0002<br \/>\nEpoch: 0001\/0001 | Batch 34500\/35000 | Loss: 0.4201<br \/>\nEpoch: 0001\/0001 | Batch 34800\/35000 | Loss: 0.0018<br \/>\nEpoch: 0001\/0001 | Train acc.: 78.39% | Val acc.: 86.84%<br \/>\nTime elapsed 43.33 min<br \/>\nTest accuracy 87.91%<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>Note that the slight accuracy improvement compared to before is likely due to randomness.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-5647656\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/training-time-1-1024x683.png\" alt=\"\" width=\"609\" height=\"406\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/training-time-1-1024x683.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/training-time-1-300x200.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/training-time-1.png 1200w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/training-time-1-300x200@2x.png 600w\" sizes=\"(max-width: 609px) 100vw, 609px\" \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Previously, I shared an article using multi-GPU training strategies to speed up the finetuning of large language models. Several of these strategies include mechanisms such as model or tensor sharding that distributes the model weights and computations across different devices to work around GPU memory limitations. However, many of us don&#8217;t have access to multi-GPU<a class=\"excerpt-read-more\" href=\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\" title=\"ReadFinetuning LLMs on a Single GPU Using Gradient Accumulation\">&#8230; Read more &raquo;<\/a><\/p>\n","protected":false},"author":16,"featured_media":5647635,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[29,41],"tags":[96,186,185,184,97],"glossary":[],"acf":{"additional_authors":false,"hide_from_archive":false,"content_type":"Blog Post","sticky":false,"default_editor":true,"show_table_of_contents":false,"custom_styles":""},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Finetuning LLMs on a Single GPU Using Gradient Accumulation<\/title>\n<meta name=\"description\" content=\"Learn how to leverage gradient accumulation in order to train large neural networks while working around hardware limitations.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Finetuning LLMs on a Single GPU Using Gradient Accumulation\" \/>\n<meta property=\"og:description\" content=\"Learn how to leverage gradient accumulation in order to train large neural networks while working around hardware limitations.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\" \/>\n<meta property=\"og:site_name\" content=\"Lightning AI\" \/>\n<meta property=\"article:published_time\" content=\"2023-03-28T17:47:39+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-04-13T11:01:16+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/grad-accumulation.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1450\" \/>\n\t<meta property=\"og:image:height\" content=\"750\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"JP Hennessy\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:site\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"JP Hennessy\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\"},\"author\":{\"name\":\"JP Hennessy\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\"},\"headline\":\"Finetuning LLMs on a Single GPU Using Gradient Accumulation\",\"datePublished\":\"2023-03-28T17:47:39+00:00\",\"dateModified\":\"2023-04-13T11:01:16+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\"},\"wordCount\":2215,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/grad-accumulation.png\",\"keywords\":[\"ai\",\"finetuning\",\"gradient accumulation\",\"llm\",\"ml\"],\"articleSection\":[\"Blog\",\"Tutorials\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\",\"url\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\",\"name\":\"Finetuning LLMs on a Single GPU Using Gradient Accumulation\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/grad-accumulation.png\",\"datePublished\":\"2023-03-28T17:47:39+00:00\",\"dateModified\":\"2023-04-13T11:01:16+00:00\",\"description\":\"Learn how to leverage gradient accumulation in order to train large neural networks while working around hardware limitations.\",\"breadcrumb\":{\"@id\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#primaryimage\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/grad-accumulation.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/grad-accumulation.png\",\"width\":1450,\"height\":750},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/lightning.ai\/pages\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Finetuning LLMs on a Single GPU Using Gradient Accumulation\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/lightning.ai\/pages\/#website\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"name\":\"Lightning AI\",\"description\":\"The platform for teams to build AI.\",\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/lightning.ai\/pages\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\",\"name\":\"Lightning AI\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"width\":1744,\"height\":856,\"caption\":\"Lightning AI\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/LightningAI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\",\"name\":\"JP Hennessy\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"caption\":\"JP Hennessy\"},\"url\":\"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Finetuning LLMs on a Single GPU Using Gradient Accumulation","description":"Learn how to leverage gradient accumulation in order to train large neural networks while working around hardware limitations.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/","og_locale":"en_US","og_type":"article","og_title":"Finetuning LLMs on a Single GPU Using Gradient Accumulation","og_description":"Learn how to leverage gradient accumulation in order to train large neural networks while working around hardware limitations.","og_url":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/","og_site_name":"Lightning AI","article_published_time":"2023-03-28T17:47:39+00:00","article_modified_time":"2023-04-13T11:01:16+00:00","og_image":[{"width":1450,"height":750,"url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/grad-accumulation.png","type":"image\/png"}],"author":"JP Hennessy","twitter_card":"summary_large_image","twitter_creator":"@LightningAI","twitter_site":"@LightningAI","twitter_misc":{"Written by":"JP Hennessy","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#article","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/"},"author":{"name":"JP Hennessy","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6"},"headline":"Finetuning LLMs on a Single GPU Using Gradient Accumulation","datePublished":"2023-03-28T17:47:39+00:00","dateModified":"2023-04-13T11:01:16+00:00","mainEntityOfPage":{"@id":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/"},"wordCount":2215,"commentCount":0,"publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"image":{"@id":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/grad-accumulation.png","keywords":["ai","finetuning","gradient accumulation","llm","ml"],"articleSection":["Blog","Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/","url":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/","name":"Finetuning LLMs on a Single GPU Using Gradient Accumulation","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/#website"},"primaryImageOfPage":{"@id":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#primaryimage"},"image":{"@id":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/grad-accumulation.png","datePublished":"2023-03-28T17:47:39+00:00","dateModified":"2023-04-13T11:01:16+00:00","description":"Learn how to leverage gradient accumulation in order to train large neural networks while working around hardware limitations.","breadcrumb":{"@id":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#primaryimage","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/grad-accumulation.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/03\/grad-accumulation.png","width":1450,"height":750},{"@type":"BreadcrumbList","@id":"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/lightning.ai\/pages\/"},{"@type":"ListItem","position":2,"name":"Finetuning LLMs on a Single GPU Using Gradient Accumulation"}]},{"@type":"WebSite","@id":"https:\/\/lightning.ai\/pages\/#website","url":"https:\/\/lightning.ai\/pages\/","name":"Lightning AI","description":"The platform for teams to build AI.","publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/lightning.ai\/pages\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/lightning.ai\/pages\/#organization","name":"Lightning AI","url":"https:\/\/lightning.ai\/pages\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","width":1744,"height":856,"caption":"Lightning AI"},"image":{"@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/LightningAI"]},{"@type":"Person","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6","name":"JP Hennessy","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","caption":"JP Hennessy"},"url":"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/"}]}},"_links":{"self":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5647630"}],"collection":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/comments?post=5647630"}],"version-history":[{"count":0,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5647630\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media\/5647635"}],"wp:attachment":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media?parent=5647630"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/categories?post=5647630"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/tags?post=5647630"},{"taxonomy":"glossary","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/glossary?post=5647630"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}