{"id":5649055,"date":"2023-10-12T21:58:40","date_gmt":"2023-10-13T01:58:40","guid":{"rendered":"https:\/\/lightning.ai\/pages\/?p=5649055"},"modified":"2023-10-16T15:07:34","modified_gmt":"2023-10-16T19:07:34","slug":"lora-insights","status":"publish","type":"post","link":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/","title":{"rendered":"Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments"},"content":{"rendered":"<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Takeaways<\/h3>LoRA is one of the most widely used, parameter-efficient finetuning techniques for training custom LLMs. From saving memory with QLoRA to selecting the optimal LoRA settings, this article provides practical insights for those interested in applying it.<\/div>\n<p>&nbsp;<\/p>\n<h2 id=\"toc1\"><span style=\"font-weight: 400;\">Introduction: Getting the Most out of LoRA<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">I&#8217;ve run hundreds, if not thousands, of experiments involving LoRA over the past few months. A few weeks ago, I took the time to delve deeper into some of the hyperparameter choices.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This is more of an experimental diary presented in sequential order. I hope it proves useful to some. Specifically, I aim to address questions about the value of QLoRA, whether to replace AdamW with SGD, the potential use of a scheduler, and how to adjust the LoRA hyperparameters.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There&#8217;s a lot to discuss on the experimental side, so I&#8217;ll keep the introduction to LoRA brief.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In short, LoRA, short for Low-Rank Adaptation (<\/span><a href=\"https:\/\/arxiv.org\/abs\/2106.09685\"><span style=\"font-weight: 400;\">Hu et al 2021<\/span><\/a><span style=\"font-weight: 400;\">), adds a small number of trainable parameters to the model while the original model parameters remain frozen.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">LoRA decomposes a weight matrix into two smaller weight matrices, as illustrated below, to approximate full supervised finetuning in a more parameter-efficient manner.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-5649062 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png\" alt=\"\" width=\"684\" height=\"404\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png 1002w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7-300x177.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7-300x177@2x.png 600w\" sizes=\"(max-width: 684px) 100vw, 684px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">For more details about LoRA, please see my in-depth article <\/span><a href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/lora-llm\/\"><span style=\"font-weight: 400;\">Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA)<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p>The topics we are going to cover in this article as organized as follows:<\/p>\n<p>1. Evaluation Tasks and Dataset<br \/>\n2. Code Framework<br \/>\n3. Choosing a Good Base Model<br \/>\n4. Evaluating the LoRA Defaults<br \/>\n5. Memory Savings with QLoRA<br \/>\n6. Learning Rate Schedulers and SGD<br \/>\n7. Iterating Over the Dataset Multiple Times<br \/>\n8. LoRA Hyperparameter Tuning Part 1: LoRA for All Layers<br \/>\n9. LoRA Hyperparameter Tuning Part 2: Increasing R<br \/>\n10. LoRA Hyperparameter Tuning Part 3: Changing Alpha<br \/>\n11. LoRA Hyperparameter Tuning Part 3: Very Large R<br \/>\n12. Leaderboard Submission<br \/>\n13. Conclusion<\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc2\">Evaluation Tasks and Dataset<\/h2>\n<p><span style=\"font-weight: 400;\">The focus of this article is on selecting the optimal settings. To stay within a reasonable scope, I&#8217;m keeping the dataset fixed and focusing solely on supervised instruction-finetuning of LLMs. (Modifications to the dataset or finetuning for classification might be addressed in future articles.)<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For the model evaluation, I selected a small subset of tasks from Eleuther AI&#8217;s <\/span><a href=\"https:\/\/github.com\/EleutherAI\/lm-evaluation-harness\/tree\/master\"><span style=\"font-weight: 400;\">Evaluation Harness<\/span><\/a><span style=\"font-weight: 400;\">, including <\/span><a href=\"https:\/\/github.com\/sylinrl\/TruthfulQA\"><span style=\"font-weight: 400;\">TruthfulQA<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/github.com\/alexwarstadt\/blimp\"><span style=\"font-weight: 400;\">BLiMP Causative,<\/span><\/a> <a href=\"https:\/\/github.com\/hendrycks\/test\"><span style=\"font-weight: 400;\">MMLU Global Facts<\/span><\/a><span style=\"font-weight: 400;\">, and simple arithmetic tasks with two (arithmetic 2ds) and four digits (arithmetic 4ds).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In each benchmark, the model performance score is normalized between 0 and 1, where 1 is a perfect score. TruthfulQA reports two scores, which are defined as follows:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">MC1 (Single-true): Given a question and 4-5 answer choices, select the only correct answer. The model&#8217;s selection is the answer choice to which it assigns the highest log-probability of completion following the question, independent of the other answer choices. The score is the simple accuracy across all questions.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">MC2 (Multi-true): Given a question and multiple true \/ false reference answers, the score is the normalized total probability assigned to the set of true answers.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">For reference, the 175B GPT-3 model has TruthfulQA MC1 and MC2 values of 0.21 and 0.33, respectively.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Below are two examples to illustrate the difference between arithmetic 2ds and arithmetic 4ds:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Arithmetic 2ds: &#8220;What is 59 minus 38&#8221;. &#8220;21&#8221;.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Arithmetic 4ds: &#8220;What is 2762 plus 2751&#8221;. &#8220;5513&#8221;.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">As mentioned above, I kept the dataset fixed, using the well-studied or rather commonly used <\/span><a href=\"https:\/\/github.com\/gururise\/AlpacaDataCleaned\"><span style=\"font-weight: 400;\">Alpaca dataset<\/span><\/a><span style=\"font-weight: 400;\"> for supervised instruction finetuning. Of course, many other datasets are available for instruction finetuning, including LIMA, Dolly, LongForm, FLAN, and more. However, exploring training on multiple datasets and dataset mixes will be an interesting topic for future studies.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The Alpaca dataset consists of approximately 50k instruction-response pairs for training with a median length of 110 tokens for the input size (using the Llama 2 <\/span><a href=\"https:\/\/github.com\/google\/sentencepiece\"><span style=\"font-weight: 400;\">SentencePiece<\/span><\/a><span style=\"font-weight: 400;\"> tokenizer), as shown in the histogram below.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-5649063 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage1.jpg\" alt=\"\" width=\"643\" height=\"482\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage1.jpg 1280w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage1-300x225.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage1-1024x768.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage1-300x225@2x.jpg 600w\" sizes=\"(max-width: 643px) 100vw, 643px\" \/><\/p>\n<p>The dataset tasks themselves can be structured as shown in the figure below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5649064\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage5.jpg\" alt=\"\" width=\"570\" height=\"425\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage5.jpg 1140w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage5-300x224.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage5-1024x764.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage5-300x224@2x.jpg 600w\" sizes=\"(max-width: 570px) 100vw, 570px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc3\">Code Framework<\/h2>\n<p><span style=\"font-weight: 400;\">The custom LLM finetuning code I used for this article is based on the open-source <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\"><span style=\"font-weight: 400;\">Lit-GPT repository<\/span><\/a><span style=\"font-weight: 400;\">. To keep the preamble of this article brief, I won&#8217;t go into the usage details, but you can find a more detailed guide in the Lit-GPT tutorials section <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/tree\/main\/tutorials\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In brief, the usage is as follows:<\/span><\/p>\n<p><strong>1) Clone the repository and install the requirements<\/strong><\/p>\n<pre><span style=\"font-weight: 400;\">git clone https:\/\/github.com\/Lightning-AI\/lit-gpt<\/span>\r\n\r\n<span style=\"font-weight: 400;\">cd lit-gpt<\/span>\r\n\r\n<span style=\"font-weight: 400;\">pip install -r requirements.txt<\/span><\/pre>\n<p><strong>2) Download and prepare a model checkpoint<\/strong><\/p>\n<pre><span style=\"font-weight: 400;\">python scripts\/download.py \\<\/span>\r\n<span style=\"font-weight: 400;\">  --repo_id mistralai\/Mistral-7B-Instruct-v0.1<\/span>\r\n<span style=\"font-weight: 400;\"># there are many other supported models<\/span><\/pre>\n<pre><span style=\"font-weight: 400;\">python scripts\/convert_hf_checkpoint.py \\<\/span>\r\n<span style=\"font-weight: 400;\">  --checkpoint_dir checkpoints\/mistralai\/Mistral-7B-Instruct-v0.1<\/span><\/pre>\n<p><strong>3) Prepare a dataset<\/strong><\/p>\n<pre><span style=\"font-weight: 400;\">python scripts\/prepare_alpaca.py \\<\/span>\r\n<span style=\"font-weight: 400;\">\u00a0\u00a0--checkpoint_dir checkpoints\/mistralai\/Mistral-7B-Instruct-v0.1<\/span><\/pre>\n<pre><span style=\"font-weight: 400;\"># or from a custom CSV file<\/span>\r\n<span style=\"font-weight: 400;\">python scripts\/prepare_csv.py \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--csv_dir MyDataset.csv \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--checkpoint_dir checkpoints\/mistralai\/Mistral-7B-Instruct-v0.1<\/span>\r\n<\/pre>\n<p><strong>4) Finetune<\/strong><\/p>\n<pre><span style=\"font-weight: 400;\">python finetune\/lora.py \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--checkpoint_dir checkpoints\/mistralai\/Mistral-7B-Instruct-v0.1\/ \\<\/span>\r\n<span style=\"font-weight: 400;\">\u00a0\u00a0--precision bf16-true<\/span><\/pre>\n<p><strong>5) Merge LoRA weights<\/strong><\/p>\n<pre><span style=\"font-weight: 400;\">python scripts\/merge_lora.py \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--checkpoint_dir \"checkpoints\/mistralai\/Mistral-7B-Instruct-v0.1\" \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--lora_path \"out\/lora\/alpaca\/Mistral-7B-Instruct-v0.1\/lit_model_lora_finetuned.pth\" \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--out_dir \"out\/lora_merged\/Mistral-7B-Instruct-v0.1\/\"<\/span>\r\n<span style=\"font-weight: 400;\">\r\ncp checkpoints\/mistralai\/Mistral-7B-Instruct-v0.1\/*.json \\<\/span>\r\n<span style=\"font-weight: 400;\">\u00a0\u00a0out\/lora_merged\/Mistral-7B-Instruct-v0.1\/<\/span><\/pre>\n<p><strong>6) Evaluate<\/strong><\/p>\n<pre><span style=\"font-weight: 400;\">python eval\/lm_eval_harness.py \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--checkpoint_dir \"out\/lora_merged\/Mistral-7B-Instruct-v0.1\/\" \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--eval_tasks \"[arithmetic_2ds, ..., truthfulqa_mc]\" \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--precision \"bf16-true\" \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--batch_size 4 \\<\/span>\r\n<span style=\"font-weight: 400;\"> \u00a0--num_fewshot 0 \\<\/span>\r\n<span style=\"font-weight: 400;\">\u00a0\u00a0--save_filepath \"results.json\"<\/span><\/pre>\n<p><strong>7) Use<\/strong><\/p>\n<pre><span style=\"font-weight: 400;\">python chat\/base.py \\ <\/span>\r\n<span style=\"font-weight: 400;\">\u00a0\u00a0--checkpoint_dir \"out\/lora_merged\/Mistral-7B-Instruct-v0.1\/\"<\/span><\/pre>\n<p>&nbsp;<\/p>\n<h2 id=\"toc4\">Choosing a Good Base Model<\/h2>\n<p><span style=\"font-weight: 400;\">The first task was to select a competent base model for the LoRA experiments. For this, I focused on models that were not already instruction-finetuned: <\/span><a href=\"https:\/\/arxiv.org\/abs\/2309.05463\"><span style=\"font-weight: 400;\">phi-1.5 1.3B<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/arxiv.org\/abs\/2310.06825\"><span style=\"font-weight: 400;\">Mistral 7B<\/span><\/a><span style=\"font-weight: 400;\">, <\/span><a href=\"https:\/\/arxiv.org\/abs\/2307.09288\"><span style=\"font-weight: 400;\">Llama 2 7B<\/span><\/a><span style=\"font-weight: 400;\">, Llama 2 13B, and <\/span><a href=\"https:\/\/falconllm.tii.ae\/\"><span style=\"font-weight: 400;\">Falcon 40B<\/span><\/a><span style=\"font-weight: 400;\">. Note that all experiments were run on a single A100 GPU.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649065\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage2.jpg\" alt=\"\" width=\"1764\" height=\"322\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage2.jpg 1764w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage2-300x55.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage2-1024x187.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage2-1536x280.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage2-300x55@2x.jpg 600w\" sizes=\"(max-width: 1764px) 100vw, 1764px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">As we can see from the table above, the Mistral 7B model performs extraordinarily well on the math benchmarks. Meanwhile, the phi-1.5 1.3B model showcases impressive TruthfulQA MC2 performance given its relatively small size. For some reason, Llama 2 13B struggles with the arithmetic benchmarks, whereas the smaller Llama 2 7B outperforms it significantly in that area.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Since researchers and practitioners are currently speculating that phi-1.5 1.3B and Mistral 7B might have been trained on benchmark test data, I chose not to use them in my experiments. Moreover, I believed that selecting the smallest of the remaining models would provide the most room for improvement while maintaining lower hardware requirements. <\/span><b>Therefore, the remainder of this article will focus on Llama 2 7B.<\/b><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc5\">Evaluating the LoRA Defaults<\/h2>\n<p><span style=\"font-weight: 400;\">First, I evaluated LoRA finetuning with the following default settings (these can be changed in the <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/bf60124fa72a56436c7d4fecc093c7fc48e84433\/finetune\/lora.py#L38\"><span style=\"font-weight: 400;\">finetune\/lora.py<\/span><\/a><span style=\"font-weight: 400;\"> script):<\/span><\/p>\n<pre class=\"hljs collapse-false language-python\"><span style=\"font-weight: 400;\"># Hyperparameters<\/span>\r\n<span style=\"font-weight: 400;\">learning_rate = 3e-4<\/span>\r\n<span style=\"font-weight: 400;\">batch_size = 128<\/span>\r\n<span style=\"font-weight: 400;\">micro_batch_size = 1<\/span>\r\n<span style=\"font-weight: 400;\">max_iters = 50000\u00a0 # train dataset size<\/span>\r\n<span style=\"font-weight: 400;\">weight_decay = 0.01<\/span>\r\n<span style=\"font-weight: 400;\">lora_r = 8<\/span>\r\n<span style=\"font-weight: 400;\">lora_alpha = 16<\/span>\r\n<span style=\"font-weight: 400;\">lora_dropout = 0.05<\/span>\r\n<span style=\"font-weight: 400;\">lora_query = True<\/span>\r\n<span style=\"font-weight: 400;\">lora_key = False<\/span>\r\n<span style=\"font-weight: 400;\">lora_value = True<\/span>\r\n<span style=\"font-weight: 400;\">lora_projection = False<\/span>\r\n<span style=\"font-weight: 400;\">lora_mlp = False<\/span>\r\n<span style=\"font-weight: 400;\">lora_head = False<\/span>\r\n<span style=\"font-weight: 400;\">warmup_steps = 100<\/span><\/pre>\n<p><span style=\"font-weight: 400;\">(Note that the batch size is 128, but we are using gradient accumulation with a microbatch size of 1 to save memory; it results in the equivalent training trajectory as regular training with batch size 128. If you are curious about how gradient accumulation works, please see my article <\/span><a href=\"https:\/\/lightning.ai\/blog\/gradient-accumulation\/\"><span style=\"font-weight: 400;\">Finetuning LLMs on a Single GPU Using Gradient Accumulation<\/span><\/a><span style=\"font-weight: 400;\">).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This configuration trained 4,194,304 LoRA parameters out of a total of 6,738,415,616 trainable parameters and took approximately 1.8 hours on my machine using a single A100. The maximum memory usage was 21.33 GB.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">To gauge the variance, I repeated the experiment three times to observe the fluctuation in performance between runs.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5649066\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage12.jpg\" alt=\"\" width=\"899\" height=\"577\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage12.jpg 1999w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage12-300x193.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage12-1024x657.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage12-1536x986.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage12-300x193@2x.jpg 600w\" sizes=\"(max-width: 899px) 100vw, 899px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">As we can see in the table above, the performance between runs is very consistent and stable. It&#8217;s also worth noting that the LoRA default model became really bad at arithmetic, but this is probably to be expected as Alpaca does not contain (m)any arithmetic tasks to the best of my knowledge.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In addition, I also looked at the 7B Llama 2 version that has been instruction-finetuned by Meta using RLHF. As we can see based on the table below, the arithmetic performance is also worse for Meta&#8217;s Llama 2 Chat model as well. However, the Chat model is much improved on the other benchmarks (except BLiMP), which we can use as a reference that we want to approach with LoRA finetuning.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649067\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage10.jpg\" alt=\"\" width=\"1424\" height=\"260\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage10.jpg 1424w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage10-300x55.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage10-1024x187.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage10-300x55@2x.jpg 600w\" sizes=\"(max-width: 1424px) 100vw, 1424px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc6\">Memory Savings with QLoRA<\/h2>\n<p><span style=\"font-weight: 400;\">Before we start tuning the LoRA hyperparameters, I wanted to explore the trade-off between modeling performance and memory savings provided by QLoRA (the popular quantized LoRA technique by <\/span><a href=\"https:\/\/arxiv.org\/abs\/2305.14314\"><span style=\"font-weight: 400;\">Dettmers et al<\/span><\/a><span style=\"font-weight: 400;\">).\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We can enable QLoRA via the <\/span><span style=\"font-weight: 400;\">&#8211;quantize<\/span><span style=\"font-weight: 400;\"> flag (here with 4-bit Normal Float type) in Lit-GPT as follows:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5649068\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage9.jpg\" alt=\"\" width=\"618\" height=\"218\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage9.jpg 1088w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage9-300x106.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage9-1024x361.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage9-300x106@2x.jpg 600w\" sizes=\"(max-width: 618px) 100vw, 618px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">In addition, I also tried 4-bit floating point precision as a control. Below is the impact on the training time and maximum memory usage:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Default LoRA (with bfloat-16):<\/span><\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">Training time: 6685.75s<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Memory used: 21.33 GB<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">QLoRA via &#8211;<\/span><span style=\"font-weight: 400;\">-quantize &#8220;bnb.nf4&#8221;<\/span><span style=\"font-weight: 400;\">:<\/span><\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">Training time: 10059.53s<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Memory used: 14.18 GB<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">QLoRA via <\/span><span style=\"font-weight: 400;\">&#8211;quantize &#8220;bnb.fp4&#8221;<\/span><span style=\"font-weight: 400;\">:<\/span><\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">Training time: 9334.45s<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Memory used: 14.19 GB<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">We can see that QLoRA decreases the memory requirements by almost 6 GB. However, the tradeoff is a 30% slower training time, which is to be expected due to the additional quantization and dequantization steps.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Next, let&#8217;s take a look at how QLoRA training affects the model performance:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649069\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage3.jpg\" alt=\"\" width=\"1728\" height=\"356\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage3.jpg 1728w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage3-300x62.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage3-1024x211.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage3-1536x316.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage3-300x62@2x.jpg 600w\" sizes=\"(max-width: 1728px) 100vw, 1728px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">As we can see in the table above, QLoRA does have a small impact on the model performance compared to regular QLoRA. The model improves on the arithmetic benchmarks but declines on the MMLU Global Facts benchmark.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Since the memory savings are quite substantial (which usually outweighs the longer training time since it allows users to run the models on smaller GPUs), I will use QLoRA for the remainder of the article. <\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc7\">Learning Rate Schedulers and SGD<\/h2>\n<p><span style=\"font-weight: 400;\">I used the <\/span><a href=\"https:\/\/arxiv.org\/abs\/1711.05101\"><span style=\"font-weight: 400;\">AdamW<\/span><\/a><span style=\"font-weight: 400;\"> optimizer for all the previous experiments since it&#8217;s a common choice for LLM training. However, it&#8217;s well known that the Adam optimizer can be quite memory-intensive. This is because it introduces and tracks two additional parameters (the moments <\/span><i><span style=\"font-weight: 400;\">m<\/span><\/i><span style=\"font-weight: 400;\"> and <\/span><i><span style=\"font-weight: 400;\">v<\/span><\/i><span style=\"font-weight: 400;\">) for each model parameter. Large language models (LLMs) have many model parameters; for instance, our Llama 2 model has 7 billion model parameters.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This section explores whether it&#8217;s worthwhile swapping AdamW with an SGD optimizer. However, for SGD optimizers it&#8217;s especially important to also introduce a learning rate scheduler. I opted for a cosine annealing schedule that lowers the learning rate after each batch update.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5649070\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage11.jpg\" alt=\"\" width=\"603\" height=\"452\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage11.jpg 1920w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage11-300x225.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage11-1024x768.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage11-1536x1152.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage11-300x225@2x.jpg 600w\" sizes=\"(max-width: 603px) 100vw, 603px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">If you are interested in more details on using learning rate schedulers in PyTorch, I have a lecture on it <\/span><a href=\"https:\/\/lightning.ai\/courses\/deep-learning-fundamentals\/unit-6-overview-essential-deep-learning-tips-tricks\/unit-6.2-learning-rates-and-learning-rate-schedulers\/\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Unfortunately, swapping AdamW with SGD resulted in only minor memory savings.<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">AdamW: 14.18 GB<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">SGD: 14.15 GB<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">This is likely due to the fact that the most memory is spend on large matrix multiplications rather than keeping additional parameters in memory.<\/span><\/p>\n<p>But this small difference is perhaps expected. With the currently chosen LoRA configuration (r=8), we have 4,194,304 trainable parameters. If Adam adds 2 additional values for each model parameter, which are stored in 16-bit floats, we have 4,194,304 * 2 * 16 bit = 134.22 megabits = 16.78 megabytes.<\/p>\n<p>We can observe a larger difference when we increase LoRA&#8217;s r to 256, which we will do later. In the case of r=256, we have 648,871,936 trainable parameters, which equals 2.6 GB using the same calculation as above. The actual measurement resulted in a 3.4 GB difference, perhaps due to some additional overhead in storing and copying optimizer states.<\/p>\n<p>The bottom line is that for small numbers of trainable parameters, such as in the case with LoRA and low r (rank) values, the memory gain from swapping AdamW with SGD can be very small, in contrast to pretraining, where we train a larger number of parameters.<\/p>\n<p><span style=\"font-weight: 400;\">Even though SGD does not provide us with notable memory savings here, let&#8217;s still have a quick look at the resulting model performance:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649071\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage13.jpg\" alt=\"\" width=\"1754\" height=\"264\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage13.jpg 1754w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage13-300x45.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage13-1024x154.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage13-1536x231.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage13-300x45@2x.jpg 600w\" sizes=\"(max-width: 1754px) 100vw, 1754px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">It seems that the performance of the SGD optimizer is comparable to that of AdamW. Interestingly, when a scheduler is added to AdamW, there&#8217;s an improvement in the TruthfulQA MC2 and MMLU Global Facts performances, but a decrease in arithmetic performance. (Note: TruthfulQA MC2 is a widely recognized benchmark featured in other public leaderboards.) For the time being, we won&#8217;t place too much emphasis on the arithmetic performance and will proceed with the remaining experiments in this article using AdamW with a scheduler.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If you want to reproduce these experiments, I found that the best AdamW learning rate was 3e-4 with a decay rate of 0.01. The best SGD learning rate was 0.1, with a momentum of 0.9. I used an additional 100 steps of learning rate warmup in both cases.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">(Based on these experiments, the cosine scheduler <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/pull\/626\"><span style=\"font-weight: 400;\">has been added to Lit-GPT<\/span><\/a><span style=\"font-weight: 400;\"> and is now enabled by default.)<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc8\">Iterating Over the Dataset Multiple Times<\/h2>\n<p><span style=\"font-weight: 400;\">So far, I have trained all models with 50k iterations &#8212; the Alpaca dataset has 50k training examples. The obvious question is whether we can improve the model performance by iterating over the training set multiple times, so I ran the previous experiment with 100k iterations, which is a 2-fold increase:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649072\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage6.jpg\" alt=\"\" width=\"1756\" height=\"266\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage6.jpg 1756w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage6-300x45.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage6-1024x155.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage6-1536x233.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage6-300x45@2x.jpg 600w\" sizes=\"(max-width: 1756px) 100vw, 1756px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">Interestingly, the increased iterations result in worse performance across the board. The decline is most significant for the arithmetic benchmarks. My hypothesis is that the Alpaca dataset does not contain any related arithmetic tasks, and the model actively unlearns basic arithmetic when it focuses more on other tasks.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Anyway, I would be lying if I said this outcome wasn&#8217;t welcome. This way, I can continue with the shorter 50k iteration experiments for the remainder of this article.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc9\">LoRA Hyperparameter Tuning Part 1: LoRA for All Layers<\/h2>\n<p><span style=\"font-weight: 400;\">Now that we have explored the basic settings surrounding the LoRA finetuning scripts, let&#8217;s turn our attention to the LoRA hyperparameters themselves. By default, LoRA was only enabled for the Key and Query matrices in the multi-head self-attention blocks. Now, we are also enabling it for the Value matrix, the projection layers, and the linear layers:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5649073\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage18.jpg\" alt=\"\" width=\"395\" height=\"335\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage18.jpg 1046w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage18-300x255.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage18-1024x869.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage18-300x255@2x.jpg 600w\" sizes=\"(max-width: 395px) 100vw, 395px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc10\">LoRA Hyperparameter Tuning Part 2: Increasing R<\/h2>\n<p><span style=\"font-weight: 400;\">One of the most important LoRA parameters is &#8220;r&#8221;, which determines the rank or dimension of the LoRA matrices, directly influencing the complexity and capacity of the model. A higher &#8220;r&#8221; means more expressive power but can lead to overfitting, while a lower &#8220;r&#8221; can reduce overfitting at the expense of expressiveness. Keeping LoRA enabled for all layers, let&#8217;s increase the r from 8 to 16 and see what impact this has on the performance:<\/span><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649074\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage17.jpg\" alt=\"\" width=\"1756\" height=\"322\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage17.jpg 1756w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage17-300x55.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage17-1024x188.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage17-1536x282.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage17-300x55@2x.jpg 600w\" sizes=\"(max-width: 1756px) 100vw, 1756px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">We can see that just increasing r by itself made the results worse, so what happened? Let&#8217;s find out in the next section.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc11\">LoRA Hyperparameter Tuning Part 3: Changing Alpha<\/h2>\n<p><span style=\"font-weight: 400;\">In the previous section, we increase the matrix rank r while leaving LoRA&#8217;s alpha parameter unchanged. A higher &#8220;alpha&#8221; would place more emphasis on the low-rank structure or regularization, while a lower &#8220;alpha&#8221; would reduce its influence, making the model rely more on the original parameters. Adjusting &#8220;alpha&#8221; helps in striking a balance between fitting the data and preventing overfitting by regularizing the model.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As a rule of thumb, it&#8217;s usually common to choose an alpha that is twice as large as the rank when finetuning LLMs (note that this is different when working with diffusion models). Let&#8217;s try this out and see what happens when we increase alpha two-fold:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649075\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage8.jpg\" alt=\"\" width=\"1768\" height=\"380\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage8.jpg 1768w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage8-300x64.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage8-1024x220.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage8-1536x330.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage8-300x64@2x.jpg 600w\" sizes=\"(max-width: 1768px) 100vw, 1768px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">As we can see, increasing alpha to 32 now yields our best model thus far! But again we bought\u00a0 this improvement with a larger number of parameters to be trained:<\/span><\/p>\n<p><span style=\"font-weight: 400;\">r=8:<\/span><\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">Number of trainable parameters: 20,277,248<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Number of non trainable parameters: 6,738,415,616<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Memory used: 16.42 GB<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">r=16:<\/span><\/p>\n<ul>\n<li><span style=\"font-weight: 400;\">Number of trainable parameters: 40,554,496<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Number of non trainable parameters: 6,738,415,616<\/span><\/li>\n<li><span style=\"font-weight: 400;\">Memory used: 16.47 GB<\/span><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">However, the number of trainable parameters is still small enough that it doesn&#8217;t noticeably impact the peak memory requirements.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Anyways, we are now finally starting to make some gains and improve the model performance by more noticeable margins. So, let&#8217;s keep going and see how far we can push this by increasing the rank and alpha:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649076\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage4.jpg\" alt=\"\" width=\"1770\" height=\"574\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage4.jpg 1770w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage4-300x97.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage4-1024x332.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage4-1536x498.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage4-300x97@2x.jpg 600w\" sizes=\"(max-width: 1770px) 100vw, 1770px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">I also ran additional experiments with exceptionally large ranks (512, 1024, and 2048), but these resulted in poorer outcomes. Some of the runs didn&#8217;t even converge to a near-zero loss during training, which is why I didn&#8217;t add them to the table.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">So far, we can note that the r=256 and alpha=512 model in the last row resulting in the best performance overall so far. As an additional control experiments, I repeated the runs with an alpha of 1 and noticed that a large alpha value was indeed necessary for the good performance:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649077\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage14.jpg\" alt=\"\" width=\"1764\" height=\"774\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage14.jpg 1764w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage14-300x132.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage14-1024x449.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage14-1536x674.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage14-300x132@2x.jpg 600w\" sizes=\"(max-width: 1764px) 100vw, 1764px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">I also repeated the experiments with alpha values of 16 and 32, and I observed the same worse performance compared to choosing the alpha value as two-times the rank.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc12\">LoRA Hyperparameter Tuning Part 3: Very Large R<\/h2>\n<p><span style=\"font-weight: 400;\">For the final tuning experiment of this article, I wanted to further optimize the alpha value of the best model from the previous section (r=256, last row), suspecting that it might be a bit too large.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649078\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage15.jpg\" alt=\"\" width=\"1758\" height=\"472\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage15.jpg 1758w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage15-300x81.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage15-1024x275.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage15-1536x412.jpg 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage15-300x81@2x.jpg 600w\" sizes=\"(max-width: 1758px) 100vw, 1758px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">As seen in the table above, choosing a large alpha value appears to be crucial when increasing the rank.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For the QLoRA model with r=256 and a=512, it&#8217;s evident that our model has made significant improvements over the base model. The only area where the finetuned model underperforms compared to the base model is in 4-digit arithmetic. However, this is understandable, considering the Alpaca dataset probably did not contain such training examples.<\/span><\/p>\n<p>Above, we&#8217;ve seen that the common recommendation of choosing alpha as two-times the rank (e.g., r=256 and alpha=512) indeed yielded the best results, and smaller alpha values resulted in worse outcomes. But how about increasing alpha past the &#8220;two-fold the rank&#8221; recommendation?<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-5649096\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/loraexp-2fold-rank.png\" alt=\"\" width=\"1760\" height=\"374\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/loraexp-2fold-rank.png 1760w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/loraexp-2fold-rank-300x64.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/loraexp-2fold-rank-1024x218.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/loraexp-2fold-rank-1536x326.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/loraexp-2fold-rank-300x64@2x.png 600w\" sizes=\"(max-width: 1760px) 100vw, 1760px\" \/><\/p>\n<p>Based on the results provided in the table above, choosing alpha such that it exceeds the &#8220;two-fold the rank&#8221; recommendation also makes the benchmark outcomes worse.<\/p>\n<h2 id=\"toc13\">Leaderboard Submission<\/h2>\n<p>We know that in machine learning, we should not use the test set multiple times. Otherwise, we risk over-optimizing to a specific task. Hence, it&#8217;s recommended to validate a model on a final independent dataset.<\/p>\n<p>Coincidentally, there&#8217;s currently the <a href=\"https:\/\/llm-efficiency-challenge.github.io\/\">NeurIPS LLM Efficiency challenge<\/a> under way, which is focused on finetuning an LLM on a single GPU. Since I was curious to see how the Llama-2 7B base model compares to our best LoRA model finetuned on Alpaca, I submitted both the base and the finetune model to their leaderboard.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5649094\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/updated-lora-exp.png\" alt=\"\" width=\"686\" height=\"193\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/updated-lora-exp.png 952w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/updated-lora-exp-300x84.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/updated-lora-exp-300x84@2x.png 600w\" sizes=\"(max-width: 686px) 100vw, 686px\" \/><\/p>\n<p>We can see that the (Q)LoRA finetuning, which took 10522.77s (~3h) to train and required 19.24 GB GPU memory with the r=256 setting, improved the performance on several but not all benchmarks. The performance could potentially be improved by considering different finetuning datasets other than Alpaca and considering alignment techniques such as RLHF, which I explained in more detail <a href=\"https:\/\/magazine.sebastianraschka.com\/p\/llm-training-rlhf-and-its-alternatives\">here<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc14\">Conclusion<\/h2>\n<p><span style=\"font-weight: 400;\">This article explored the various knobs we can tune when training custom LLMs using LoRA. We found that QLoRA is a great memory-saver even though it comes at an increased runtime cost. Moreover, while learning rate schedulers can be beneficial, choosing between AdamW and SGD optimizers makes little difference. And iterating over the dataset more than once can make the results even worse. The best bang for the buck can be achieved by optimizing the LoRA settings, including the rank. Increasing the rank will result in more trainable parameters, which could lead to higher degrees of overfitting and runtime costs. However, when increasing the rank, choosing the appropriate alpha value is important.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">This article was by no means exhaustive and in the sense that I did not have the time and resources to explore all possible configurations. Also, future improvements could be achieved by considering other datasets and models.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">I hope that you can gain one or the other insight that you can apply to your projects. I kept the background information and explanations on various concepts like LoRA, learning rate schedulers, gradient accumulation, and so on to a minimum so that this article doesn&#8217;t not become unreasonably longer. However, I am more than happy to chat if you have any questions or concerns. You can reach me on <\/span><a href=\"https:\/\/twitter.com\/rasbt\"><span style=\"font-weight: 400;\">X\/Twitter<\/span><\/a><span style=\"font-weight: 400;\"> or <\/span><a href=\"https:\/\/linkedin.com\/in\/sebastianraschka\"><span style=\"font-weight: 400;\">LinkedIn<\/span><\/a> or reach out to <a href=\"https:\/\/twitter.com\/LightningAI\">@LightningAI<\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If you found this article useful, I would appreciate it if you could share it with your colleagues.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For general feedback, suggestions, or improvements to Lit-GPT, please feel free to use the <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\"><span style=\"font-weight: 400;\">GitHub issue tracker<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>&nbsp; Introduction: Getting the Most out of LoRA I&#8217;ve run hundreds, if not thousands, of experiments involving LoRA over the past few months. A few weeks ago, I took the time to delve deeper into some of the hyperparameter choices. This is more of an experimental diary presented in sequential order. I hope it proves<a class=\"excerpt-read-more\" href=\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/\" title=\"ReadFinetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments\">&#8230; Read more &raquo;<\/a><\/p>\n","protected":false},"author":16,"featured_media":5649062,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[29,106,41],"tags":[186,242,188,240,241],"glossary":[216,217,218,224],"acf":{"additional_authors":false,"hide_from_archive":false,"content_type":"Blog Post","sticky":false,"code_embed":false,"custom_styles":"div#table-of-contents ~ .container.pt-0 {\r\n    display: none;\r\n}\r\n\r\nmain h2 {\r\n    scroll-padding-top: 100px;\r\n    scroll-margin-top:100px;\r\n}\r\n\r\nmain img{\r\n  display:block;\r\n}","mathjax":false,"default_editor":true,"show_table_of_contents":true,"tabs":false,"table_of_contents":"<h4>Table of Contents<\/h4>\n<ul>\n<li><a href=\"#toc1\">Introduction: Getting the Most out of LoRA<\/a><\/li>\n<li><a href=\"#toc2\">Evaluation Tasks and Dataset<\/a><\/li>\n<li><a href=\"#toc3\">Code Framework<\/a><\/li>\n<li><a href=\"#toc4\">Choosing a Good Base Model<\/a><\/li>\n<li><a href=\"#toc5\">Evaluating the LoRA Defaults<\/a><\/li>\n<li><a href=\"#toc6\">Memory Savings with QLoRA<\/a><\/li>\n<li><a href=\"#toc7\">Learning Rate Schedulers and SGD<\/a><\/li>\n<li><a href=\"#toc8\">Iterating Over the Dataset Multiple Times<\/a><\/li>\n<li><a href=\"#toc9\">LoRA Hyperparameter Tuning Part 1: LoRA for All Layers<\/a><\/li>\n<li><a href=\"#toc10\">LoRA Hyperparameter Tuning Part 2: Increasing R<\/a><\/li>\n<li><a href=\"#toc11\">LoRA Hyperparameter Tuning Part 3: Changing Alpha<\/a><\/li>\n<li><a href=\"#toc12\">LoRA Hyperparameter Tuning Part 3: Very Large R<\/a><\/li>\n<li><a href=\"#toc13\">Leaderboard Submission<\/a><\/li>\n<li><a href=\"#toc14\">Conclusion<\/a><\/li>\n<\/ul>\n"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI<\/title>\n<meta name=\"description\" content=\"LoRA is one of the most widely used, parameter-efficient finetuning techniques for training custom LLMs. From saving memory with QLoRA to selecting the optimal LoRA settings, this article provides practical insights for those interested in applying it.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI\" \/>\n<meta property=\"og:description\" content=\"LoRA is one of the most widely used, parameter-efficient finetuning techniques for training custom LLMs. From saving memory with QLoRA to selecting the optimal LoRA settings, this article provides practical insights for those interested in applying it.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/\" \/>\n<meta property=\"og:site_name\" content=\"Lightning AI\" \/>\n<meta property=\"article:published_time\" content=\"2023-10-13T01:58:40+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-10-16T19:07:34+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1002\" \/>\n\t<meta property=\"og:image:height\" content=\"592\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"JP Hennessy\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:site\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"JP Hennessy\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"16 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/\"},\"author\":{\"name\":\"JP Hennessy\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\"},\"headline\":\"Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments\",\"datePublished\":\"2023-10-13T01:58:40+00:00\",\"dateModified\":\"2023-10-16T19:07:34+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/\"},\"wordCount\":3110,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png\",\"keywords\":[\"finetuning\",\"Lit-GPT\",\"LLMs\",\"LoRA\",\"QLoRA\"],\"articleSection\":[\"Blog\",\"Community\",\"Tutorials\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/\",\"url\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/\",\"name\":\"Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png\",\"datePublished\":\"2023-10-13T01:58:40+00:00\",\"dateModified\":\"2023-10-16T19:07:34+00:00\",\"description\":\"LoRA is one of the most widely used, parameter-efficient finetuning techniques for training custom LLMs. From saving memory with QLoRA to selecting the optimal LoRA settings, this article provides practical insights for those interested in applying it.\",\"breadcrumb\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#primaryimage\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png\",\"width\":1002,\"height\":592},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/lightning.ai\/pages\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/lightning.ai\/pages\/#website\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"name\":\"Lightning AI\",\"description\":\"The platform for teams to build AI.\",\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/lightning.ai\/pages\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\",\"name\":\"Lightning AI\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"width\":1744,\"height\":856,\"caption\":\"Lightning AI\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/LightningAI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\",\"name\":\"JP Hennessy\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"caption\":\"JP Hennessy\"},\"url\":\"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI","description":"LoRA is one of the most widely used, parameter-efficient finetuning techniques for training custom LLMs. From saving memory with QLoRA to selecting the optimal LoRA settings, this article provides practical insights for those interested in applying it.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/","og_locale":"en_US","og_type":"article","og_title":"Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI","og_description":"LoRA is one of the most widely used, parameter-efficient finetuning techniques for training custom LLMs. From saving memory with QLoRA to selecting the optimal LoRA settings, this article provides practical insights for those interested in applying it.","og_url":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/","og_site_name":"Lightning AI","article_published_time":"2023-10-13T01:58:40+00:00","article_modified_time":"2023-10-16T19:07:34+00:00","og_image":[{"width":1002,"height":592,"url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png","type":"image\/png"}],"author":"JP Hennessy","twitter_card":"summary_large_image","twitter_creator":"@LightningAI","twitter_site":"@LightningAI","twitter_misc":{"Written by":"JP Hennessy","Est. reading time":"16 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#article","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/"},"author":{"name":"JP Hennessy","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6"},"headline":"Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments","datePublished":"2023-10-13T01:58:40+00:00","dateModified":"2023-10-16T19:07:34+00:00","mainEntityOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/"},"wordCount":3110,"commentCount":0,"publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png","keywords":["finetuning","Lit-GPT","LLMs","LoRA","QLoRA"],"articleSection":["Blog","Community","Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/lightning.ai\/pages\/community\/lora-insights\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/","url":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/","name":"Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments - Lightning AI","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/#website"},"primaryImageOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#primaryimage"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png","datePublished":"2023-10-13T01:58:40+00:00","dateModified":"2023-10-16T19:07:34+00:00","description":"LoRA is one of the most widely used, parameter-efficient finetuning techniques for training custom LLMs. From saving memory with QLoRA to selecting the optimal LoRA settings, this article provides practical insights for those interested in applying it.","breadcrumb":{"@id":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/lightning.ai\/pages\/community\/lora-insights\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#primaryimage","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/10\/lora-expimage7.png","width":1002,"height":592},{"@type":"BreadcrumbList","@id":"https:\/\/lightning.ai\/pages\/community\/lora-insights\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/lightning.ai\/pages\/"},{"@type":"ListItem","position":2,"name":"Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments"}]},{"@type":"WebSite","@id":"https:\/\/lightning.ai\/pages\/#website","url":"https:\/\/lightning.ai\/pages\/","name":"Lightning AI","description":"The platform for teams to build AI.","publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/lightning.ai\/pages\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/lightning.ai\/pages\/#organization","name":"Lightning AI","url":"https:\/\/lightning.ai\/pages\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","width":1744,"height":856,"caption":"Lightning AI"},"image":{"@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/LightningAI"]},{"@type":"Person","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6","name":"JP Hennessy","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","caption":"JP Hennessy"},"url":"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/"}]}},"_links":{"self":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5649055"}],"collection":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/comments?post=5649055"}],"version-history":[{"count":0,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5649055\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media\/5649062"}],"wp:attachment":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media?parent=5649055"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/categories?post=5649055"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/tags?post=5649055"},{"taxonomy":"glossary","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/glossary?post=5649055"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}