{"id":5648254,"date":"2023-06-14T22:10:05","date_gmt":"2023-06-15T02:10:05","guid":{"rendered":"https:\/\/lightning.ai\/pages\/?p=5648254"},"modified":"2023-06-22T14:37:03","modified_gmt":"2023-06-22T18:37:03","slug":"finetuning-falcon-efficiently","status":"publish","type":"post","link":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/","title":{"rendered":"Finetuning Falcon LLMs More Efficiently With LoRA and Adapters"},"content":{"rendered":"<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Key takeaway<\/h3><br \/>\nUsing parameter-efficient finetuning methods outlined in this article, it&#8217;s possible to finetune an open-source LLM like Falcon in 1 hour on a single GPU instead of a day on 6 GPUs.<\/div>\n<p>Finetuning allows us to adapt pretrained LLMs in a cost-efficient manner. But which method should we use? This article compares different parameter-efficient finetuning methods for the latest top-performing open-source LLM, Falcon.<\/p>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">Pretraining and Finetuning LLMs<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Before we dive into the LLM finetuning details, let&#8217;s briefly recap how we train LLMs in general.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">LLMs are trained in two stages. The first stage is an expensive pretraining step to train the models on a large, unlabeled dataset containing trillions of words. The resulting models are often called <\/span><span class=\"md-pair-s \"><em><span class=\"md-plain\">foundation<\/span><\/em><\/span><span class=\"md-plain\"> models since they have general capabilities and can be adapted for various downstream tasks. A classic example of a pretrained model is GPT-3.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The second stage is finetuning such a foundation model. This typically involves training the pretrained model to follow instructions or perform another specific target task (for example, sentiment classification). ChatGPT (which started as a finetuned version of the GPT-3 foundation model) is a typical example of a model that was finetuned to follow instructions. Using parameter-efficient finetuning methods outlined in this article, it&#8217;s possible to finetune an LLM in 1 hour on a single GPU instead of a day on 6 GPUs.<\/span><\/p>\n<p>&nbsp;<\/p>\n<div align=\"center\">\n<div id=\"attachment_5648255\" style=\"width: 521px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648255\" class=\" wp-image-5648255\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/finetuning-process.png\" alt=\"\" width=\"511\" height=\"371\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/finetuning-process.png 1652w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/finetuning-process-300x218.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/finetuning-process-1024x743.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/finetuning-process-1536x1114.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/finetuning-process-300x218@2x.png 600w\" sizes=\"(max-width: 511px) 100vw, 511px\" \/><p id=\"caption-attachment-5648255\" class=\"wp-caption-text\">Finetuning a pretrained LLM to follow instructions<\/p><\/div>\n<\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Finetuning also allows the model to better adapt to specific domains or types of text that were not well represented in its original training data. For example, we might finetune a model on medical literature if we want it to understand and generate medical texts. <\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Besides building custom chatbots, finetuning allows customization of these models to specific business needs to offer superior performance in targeted applications. Furthermore, it can also provide a data privacy advantage when data cannot be uploaded or shared with cloud APIs.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s\"><strong><span class=\"md-plain\">This article aims to illustrate how to finetune a top-performing LLM efficiently and cost-effectively in a few hours on a single GPU.<\/span><\/strong><\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">Finetuning Versus ChatGPT<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In the era of ChatGPT, why do we care about finetuning models in the first place?<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The problem with closed models such as OpenAI&#8217;s ChatGPT and Google&#8217;s Bard is that they cannot be readily customized, which makes them less attractive for many use cases. However, fortunately, we have seen a large number of open-source LLMs emerging in recent months. (While ChatGPT and Bard have strong in-context learning capabilities, finetuned models outperform generalist models on specific tasks; recent research examples highlighting this include <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2305.14201\"><span class=\"md-plain\">Goat<\/span><\/a><\/span><span class=\"md-plain\"> and <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2305.15334\"><span class=\"md-plain\">Gorilla<\/span><\/a><\/span><span class=\"md-plain\">.)<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">Open Source LLMs and the Falcon Architecture<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Finetuning open-source LLMs has several benefits, such as better customization capabilities and task performance. Furthermore, open-source LLMs are an excellent testbed for researchers to develop novel techniques. But if we adopt an open-source model today, which one should it be?<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">As of this writing, the Falcon model, developed by <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/www.tii.ae\/\"><span class=\"md-plain\">Technology Innovation Institute<\/span><\/a><\/span><span class=\"md-plain\">, is currently the top-performing open-source LLM. And in this article, we will learn how to finetune it efficiently, for example, on your custom dataset.<\/span><\/p>\n<p>&nbsp;<\/p>\n<div align=\"center\">\n<div id=\"attachment_5648256\" style=\"width: 964px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648256\" class=\" wp-image-5648256\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm.png\" alt=\"\" width=\"954\" height=\"273\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm.png 2488w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-300x86.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-1024x292.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-1536x438.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-2048x584.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-300x86@2x.png 600w\" sizes=\"(max-width: 954px) 100vw, 954px\" \/><p id=\"caption-attachment-5648256\" class=\"wp-caption-text\">Excerpt from the <a href=\"https:\/\/huggingface.co\/spaces\/HuggingFaceH4\/open_llm_leaderboard\">OpenLLM leaderboard<\/a><\/p><\/div>\n<\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Falcon LLMs come in different sizes: as of this writing, there&#8217;s a 7 billion parameter (Falcon 7B) variant and a 40 billion parameter variant (Falcon 40B). Furthermore, each size comes as a foundation (Falcon 7B and Falcon 40B) and instruction-tuned model (Falcon 7B-instruct and Falcon 40B-instruct). The instruction-tuned models are already finetuned for general-purpose tasks (similar to ChatGPT), but they can be further finetuned on domain-specific data if needed. (PS: A <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/twitter.com\/TIIuae\/status\/1664353061840601088?s=20\"><span class=\"md-plain\">180B version is also in the works<\/span><\/a><\/span><span class=\"md-plain\">.)<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Note that the Falcon model is fully open-source and was released under a permissive <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/www.apache.org\/licenses\/LICENSE-2.0\"><span class=\"md-plain\">Apache version 2.0<\/span><\/a><\/span><span class=\"md-plain\"> license, which permits unrestricted commercial use &#8212; it&#8217;s the same license PyTorch Lightning, TensorFlow, and OpenOffice use, for example.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s \"><strong><span class=\"md-plain\">How is Falcon different from other LLMs such as GPT or LLaMA?<\/span><\/strong><\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Besides the better performance on the OpenLLM leaderboard, as highlighted above, there are also some small architectural differences between Falcon, LLaMA, and GPT. LLaMA (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2302.13971\"><span class=\"md-plain\">Touvron et al. 2023<\/span><\/a><\/span><span class=\"md-plain\">) introduced the following architecture improvements, which likely contributed to LLaMA&#8217;s better performance over GPT-3 (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2005.14165\"><span class=\"md-plain\">Brown at al. 2020<\/span><\/a><\/span><span class=\"md-plain\">):<\/span><\/p>\n<ul class=\"ul-list\" data-mark=\"-\">\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Similar to GPT-3, LLaMA places the layer normalization before the self-attention blocks; however, instead of using LayerNorm (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/1607.06450\"><span class=\"md-plain\">Ba et al. 2016<\/span><\/a><\/span><span class=\"md-plain\">) as in GPT-3, the researchers opted for the more recent RMSNorm (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/1910.07467\"><span class=\"md-plain\">Zhang and Sennrich 2019<\/span><\/a><\/span><span class=\"md-plain\">) variant.<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">LLaMA borrows the idea of using SwiGLU (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2002.05202\"><span class=\"md-plain\">Shazeer 2020<\/span><\/a><\/span><span class=\"md-plain\">) activations from PaLM (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2204.02311\"><span class=\"md-plain\">Chowdhery et al. 2022<\/span><\/a><\/span><span class=\"md-plain\">), instead of using ReLU as in GPT-3.<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Finally, LLaMA replaced the absolute positional embeddings used in GPT-3 with rotary positional embeddings (RoPE) (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2104.09864\"><span class=\"md-plain\">Su et al. 2022<\/span><\/a><\/span><span class=\"md-plain\">) similar to GPTNeo (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2204.06745\"><span class=\"md-plain\">Black et al. 2022<\/span><\/a><\/span><span class=\"md-plain\">).<\/span><\/p>\n<\/li>\n<\/ul>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">So, <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/huggingface.co\/tiiuae\/falcon-40b\"><span class=\"md-plain\">based on what&#8217;s currently known<\/span><\/a><\/span><span class=\"md-plain\">, Falcon adopts the same RoPE embeddings as LLaMA (and GPTNeo) but otherwise shares the same architecture as GPT-3, except for using multiquery attention (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/1911.02150\"><span class=\"md-plain\">Shazeer 2019<\/span><\/a><\/span><span class=\"md-plain\">).<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Multiquery attention is a concept where the same key and value tensors are shared for efficiency across different attention heads, as illustrated for a multihead attention block below.<\/span><\/p>\n<div align=\"center\">\n<div id=\"attachment_5648258\" style=\"width: 639px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648258\" class=\"wp-image-5648258\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/multiquery.png\" alt=\"\" width=\"629\" height=\"370\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/multiquery.png 1652w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/multiquery-300x177.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/multiquery-1024x602.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/multiquery-1536x904.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/multiquery-300x177@2x.png 600w\" sizes=\"(max-width: 629px) 100vw, 629px\" \/><p id=\"caption-attachment-5648258\" class=\"wp-caption-text\">Multiquery attention<\/p><\/div>\n<\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Furthermore, according to the <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/huggingface.co\/tiiuae\/falcon-40b#training-data\"><span class=\"md-plain\">training data information<\/span><\/a><\/span><span class=\"md-plain\">, Falcon 40-B was trained on 1000B tokens, where 82% of these tokens were from <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/huggingface.co\/datasets\/tiiuae\/falcon-refinedweb\"><span class=\"md-plain\">RefinedWeb<\/span><\/a><\/span><span class=\"md-plain\"> corpus, and the remaining tokens stemmed from books, papers, conversations (Reddit, StackOverflow, and HackerNews), and code.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">While the official Falcon paper has yet to be released, a related paper <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2306.01116\"><span class=\"md-pair-s \"><em><span class=\"md-plain\">The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only<\/span><\/em><\/span><\/a><\/span><span class=\"md-plain\"> provides evidence that curated web data could have been the key to success here<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">To summarize, the Falcon architecture is very similar to GPT-3 and LLaMA. The key-differentiating factor that led to the good performance of Falcon can likely be attributed to its training dataset. <\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">Parameter-Efficient Finetuning Methods<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The remainder of this article will mostly focus on the Falcon 7B, which allows us to finetune the model on a single GPU. Falcon 7B is currently considered the best open-source LLM in its size class. (But the same code outlined in the remainder of this article can be used for the larger 40B variants as well.)<\/span><\/p>\n<div align=\"center\">\n<div id=\"attachment_5648259\" style=\"width: 712px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648259\" class=\" wp-image-5648259\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-2.png\" alt=\"\" width=\"702\" height=\"357\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-2.png 1988w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-2-300x152.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-2-1024x520.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-2-1536x780.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/openllm-2-300x152@2x.png 600w\" sizes=\"(max-width: 702px) 100vw, 702px\" \/><p id=\"caption-attachment-5648259\" class=\"wp-caption-text\">Notable open-source LLMs. Excerpt from the OpenLLM leaderboard<\/p><\/div>\n<\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">There are many parameter-efficient finetuning paradigms, as outlined in the excellent <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2303.15647\"><span class=\"md-pair-s \"><em><span class=\"md-plain\">Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning<\/span><\/em><\/span><\/a><\/span><span class=\"md-plain\"> survey. <\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">All of these methods achieve the same goal: They let us train a model in a more parameter-efficient fashion compared to conventional finetuning, where we update the original model parameters. The big question is, which ones are most worthwhile in practice? <\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Let&#8217;s start with a performance benchmark, then dive into how these different methods work.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">Performance Comparison<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">To use a common dataset for this performance benchmark, we will consider the popular <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/gururise\/AlpacaDataCleaned\"><span class=\"md-plain\">Alpaca dataset<\/span><\/a><\/span><span class=\"md-plain\"> for instruction-finetuning, which consists of 52k instruction-finetuning examples. It&#8217;s structured as follows:<\/span><\/p>\n<ul class=\"ul-list\" data-mark=\"-\">\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Instruction: &#8220;Give three tips for staying healthy.&#8221;<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Output: 1.Eat a balanced diet and make sure to include plenty of fruits and vegetables. 2. Exercise regularly to keep your body active and strong. 3. Get enough sleep and maintain a consistent sleep schedule.&#8221;<\/span><\/p>\n<\/li>\n<\/ul>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The three methods we consider are <\/span><\/p>\n<ul class=\"ul-list\" data-mark=\"-\">\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Low-Rank Adaptation (LoRA) (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2106.09685\"><span class=\"md-plain\">Hu et al. 2021<\/span><\/a><\/span><span class=\"md-plain\">);<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">LLaMA Adapter (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2303.16199\"><span class=\"md-plain\">Zhang et al. 2023<\/span><\/a><\/span><span class=\"md-plain\">);<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">LLaMA-Adapter v2 (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2304.15010\"><span class=\"md-plain\">Gao et al. 2023<\/span><\/a><\/span><span class=\"md-plain\">).<\/span><\/p>\n<p class=\"md-end-block md-p\">\n<\/li>\n<\/ul>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Yes, we can use LLaMA-Adapter methods for finetuning &#8212; despite the name, these adapter methods are not specific to the LLaMA architecture, as we will discuss later.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s \"><strong><span class=\"md-plain\">Preparing the model and dataset<\/span><\/strong><\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">For this benchmark, we will be using the <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/Lightning-AI\/lit-parrot\"><span class=\"md-plain\">Lit-Parrot<\/span><\/a><\/span><span class=\"md-plain\"> open-source library, which provides efficient implementations for training and using various LLMs.<\/span><\/p>\n<div align=\"center\">\n<div id=\"attachment_5648260\" style=\"width: 2164px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648260\" class=\"size-full wp-image-5648260\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lit-parrot.png\" alt=\"\" width=\"2154\" height=\"1690\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lit-parrot.png 2154w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lit-parrot-300x235.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lit-parrot-1024x803.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lit-parrot-1536x1205.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lit-parrot-2048x1607.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lit-parrot-300x235@2x.png 600w\" sizes=\"(max-width: 2154px) 100vw, 2154px\" \/><p id=\"caption-attachment-5648260\" class=\"wp-caption-text\">The Lit-Parrot repository (https:\/\/github.com\/Lightning-AI\/lit-parrot)<\/p><\/div>\n<\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The first step is to download the model:<\/span><\/p>\n<p><pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><span role=\"presentation\">python scripts\/download.py --repo_id tiiuae\/falcon-7b<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/span><\/p>\n<pre class=\"md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded\" lang=\"\" spellcheck=\"false\"><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">(This requires approximately 20 Gb of storage.)<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Second, we convert the weights into a standardized form:<\/span><\/p>\n<p><pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><span role=\"presentation\">python scripts\/convert_hf_checkpoint.py --checkpoint_dir checkpoints\/tiiuae\/falcon-7b<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/span><\/p>\n<pre class=\"md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded\" lang=\"\" spellcheck=\"false\"><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Third, we have to download the dataset. For this example, we will be using the <a href=\"https:\/\/github.com\/gururise\/AlpacaDataCleaned\">Alpaca dataset<\/a> consisting of 52k instruction pairs:<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\"><span role=\"presentation\"><pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">python scripts\/prepare_alpaca.py --checkpoint_dir\u00a0checkpoints\/tiiuae\/falcon-7b\/<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/span><\/span><\/p>\n<pre class=\"md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded\" lang=\"\" spellcheck=\"false\"><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">(More on using custom datasets later.)<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s \"><strong><span class=\"md-plain\">Running the code<\/span><\/strong><\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Now, we are running the finetuning scripts for the Falcon 7B model. We are going to compare 4 different methods below. For now, we are going to focus on the finetuning results. And we will discuss how these methods work later in this article.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Adapter:<\/span><\/p>\n<p class=\"md-end-block md-p\"><pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">python finetune\/adapter.py --checkpoint_dir checkpoints\/tiiuae\/falcon-7b\/<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Adapter v2:<\/span><\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">python finetune\/adapter_v2.py --checkpoint_dir checkpoints\/tiiuae\/falcon-7b\/<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<pre class=\"md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded\" lang=\"\" spellcheck=\"false\"><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">LoRA:<\/span><\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">python finetune\/lora.py --checkpoint_dir checkpoints\/tiiuae\/falcon-7b\/<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<pre class=\"md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded\" lang=\"\" spellcheck=\"false\"><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Full finetuning (updating all layers):<\/span><\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">python finetune\/lora.py --checkpoint_dir checkpoints\/tiiuae\/falcon-7b\/<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<pre class=\"md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded\" lang=\"\" spellcheck=\"false\"><\/pre>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Let&#8217;s take a look at the time it takes to finetune the LLM first:<\/span><\/p>\n<div align=\"center\">\n<p class=\"md-end-block md-p\"><span class=\"md-image md-img-loaded\" data-src=\"figures\/training-time.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648261\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/training-time.png\" alt=\"\" width=\"758\" height=\"394\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/training-time.png 1710w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/training-time-300x156.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/training-time-1024x532.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/training-time-1536x798.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/training-time-300x156@2x.png 600w\" sizes=\"(max-width: 758px) 100vw, 758px\" \/><\/span><\/p>\n<\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">As we can see in the chart above, using a parameter-efficient finetuning method is about 9 times faster than finetuning all layers (&#8220;full&#8221;). Moreover, finetuning all layers required 6 GPUs due to memory constraints, whereas <\/span><span class=\"md-pair-s \"><strong><span class=\"md-plain\">the Adapter methods and LoRA could be used on a single GPU<\/span><\/strong><\/span><span class=\"md-plain\">.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">So, speaking of GPU memory requirements, the peak memory requirements are plotted below:<\/span><\/p>\n<div align=\"center\">\n<p class=\"md-end-block md-p\"><span class=\"md-image md-img-loaded\" data-src=\"figures\/memory-requirements.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648262\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/memory-requirements.png\" alt=\"\" width=\"623\" height=\"404\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/memory-requirements.png 1522w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/memory-requirements-300x195.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/memory-requirements-1024x665.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/memory-requirements-300x195@2x.png 600w\" sizes=\"(max-width: 623px) 100vw, 623px\" \/><\/span><\/p>\n<\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Finetuning all layers of Falcon 7B required ~40 GB on each of the 6 GPUs (here, via tensor sharding using DeepSpeed). So, that&#8217;s 240 Gb in total. In contrast, the parameter-efficient finetuning methods only required ~16 GB RAM, which allows users to even finetune these models on a single consumer-grade GPU.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">By the way, note that the memory requirements are directly related to the number of parameters that are required to be updated for each method:<\/span><\/p>\n<ul class=\"ul-list\" data-mark=\"-\">\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Full finetuning: 7,217,189,760<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Adapter: 1,365,330<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Adapter v2: 3,839,186<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">LoRA: 3,506,176<\/span><\/p>\n<\/li>\n<\/ul>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Yes, that&#8217;s right, full finetuning (updating all layers) requires updating 2000 times more parameters than the Adapter v2 or LoRA methods, while the resulting modeling performance of the latter is equal to (and sometimes even better than) full finetuning, as reported in <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2106.09685\"><span class=\"md-plain\">Hu et al. 2021<\/span><\/a><\/span><span class=\"md-plain\">.<\/span><\/p>\n<p>And regarding inference speed, we have the following performance:<\/p>\n<ul>\n<li>LoRA: 21.33 tokens\/sec; Memory used: 14.59 GB (it is possible to merge the LoRA weights with the original weights to improve the performance to &gt;28 tokens\/s)<\/li>\n<li>Adapter: 26.22 tokens\/sec; Memory used: 14.59 GB<\/li>\n<li>Adapter v2: 24.73 tokens\/sec; Memory used: 14.59 GB<\/li>\n<\/ul>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s \"><strong><span class=\"md-plain\">Hyperparameters<\/span><\/strong><\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">If you want to replicate the results above, here is an overview of the hyperparameter settings I used:<\/span><\/p>\n<ul class=\"ul-list\" data-mark=\"-\">\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s\" spellcheck=\"false\"><code>bfloat16<\/code><\/span><span class=\"md-plain\"> precision (I wrote more about bfloat 16 in the article <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/accelerating-large-language-models-with-mixed-precision-techniques\/\"><span class=\"md-plain\">Accelerating Large Language Models with Mixed-Precision Techniques<\/span><\/a><\/span><span class=\"md-plain\">). <\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Also, the scripts were configured to train the models for 52k iterations (the size of the Alpaca dataset) using an effective batch size of 128 with gradient accumulation (more details on gradient accumulation in my article <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\"><span class=\"md-plain\">Finetuning LLMs on a Single GPU Using Gradient Accumulation<\/span><\/a><\/span><span class=\"md-plain\">).<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">For LoRA, I used a rank of 8 to roughly match the number of parameters added by Adapter v2.<\/span><\/p>\n<\/li>\n<li class=\"md-list-item\">\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s\" spellcheck=\"false\"><code>adapter.py<\/code><\/span><span class=\"md-plain\">, <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>adapter_v2.py<\/code><\/span><span class=\"md-plain\">, and <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>lora.py<\/code><\/span><span class=\"md-plain\"> were trained on a single A100 GPU each. The <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>full.py<\/code><\/span><span class=\"md-plain\"> script required 6 A100 GPUs and tensor sharding via DeepSpeed.<\/span><\/p>\n<\/li>\n<\/ul>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In addition, I uploaded the scripts with the modified settings <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/rasbt\/LLM-finetuning-scripts\/tree\/main\/lit-benchmarks\/falcon-7b\"><span class=\"md-plain\">here<\/span><\/a><\/span><span class=\"md-plain\"> on GitHub for reference purposes.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">Quality Comparison<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">While a detailed performance benchmark on real-world tasks is out-of-scope for this blog article, the qualitative model performance of these methods is approximately the same. It matches the full finetuning performance as discussed in the <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2106.09685\"><span class=\"md-plain\">LoRA<\/span><\/a><\/span><span class=\"md-plain\"> and <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2303.16199\"><span class=\"md-plain\">LLaMA-Adapter<\/span><\/a><\/span><span class=\"md-plain\"> papers.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">If you want to use and evaluate these models, you can use the following <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>generate<\/code><\/span><span class=\"md-plain\"> scripts provided in lit-parrot, for example:<\/span><\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">python generate\/lora.py --checkpoint_dir checkpoints\/tiiuae\/falcon-7b --lora_path out\/lora\/alpaca\/lit_model_lora_finetuned.pth <\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<pre class=\"md-fences md-end-block md-fences-with-lineno ty-contain-cm modeLoaded\" lang=\"\" spellcheck=\"false\"><\/pre>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">LLaMA-Adapter<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In short, the LLaMA-Adapter method (which we referred to as <\/span><span class=\"md-pair-s \"><em><span class=\"md-plain\">Adapter<\/span><\/em><\/span><span class=\"md-plain\"> in the blog post) adds a small number of trainable tensors (parameters) to an existing LLM. Here, the idea is that only the new parameters are trained, whereas the original parameters are left frozen. This can save a lot of compute and memory during backpropagation.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In a bit more detail LLaMA-Adapter adds prepends tunable prompt tensors (prefixes) to the embedded inputs. In the LLaMA-Adapter method, these prefixes are learned and maintained within an embedding table rather than being provided externally. Each transformer block in the model has its own distinct learned prefix, allowing for more tailored adaptation across different model layers.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In addition, LLaMA-Adapter introduces a zero-initialized attention mechanism coupled with gating. The motivation behind this so-called <\/span><span class=\"md-pair-s \"><em><span class=\"md-plain\">zero-init<\/span><\/em><\/span><span class=\"md-plain\"> attention and gating is that adapters and prefix tuning could potentially disrupt the linguistic knowledge of the pretrained LLM by incorporating randomly initialized tensors (prefix prompts or adapter layers), resulting in unstable finetuning and high loss values during initial training phases.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The main concept behind the LLaMA-Adapter method is illustrated in the visualization below, where the modified parts of a regular transformer block are highlighted in purple.<\/span><\/p>\n<div align=\"center\">\n<div id=\"attachment_5648263\" style=\"width: 686px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648263\" class=\"wp-image-5648263\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/llama-adapter.png\" alt=\"\" width=\"676\" height=\"349\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/llama-adapter.png 1670w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/llama-adapter-300x155.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/llama-adapter-1024x529.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/llama-adapter-1536x793.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/llama-adapter-300x155@2x.png 600w\" sizes=\"(max-width: 676px) 100vw, 676px\" \/><p id=\"caption-attachment-5648263\" class=\"wp-caption-text\">Outline of LLaMA-Adapter<\/p><\/div>\n<\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">One key idea is to add a small number of trainable parameters. The other important thing to notice here is that the method is not unique to LLaMA LLMs &#8212; this is why we could use it to finetune the Falcon model, for example.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">If you are interested in additional details about the LLaMA-Adapter method, check out my article <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/\"><span class=\"md-plain\">Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters<\/span><\/a><\/span><span class=\"md-plain\">.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">LLaMA-Adapter v2<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">When finetuning LLMs on text and instructions, the more recent LLaMA-Adapter v2 (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2304.15010\"><span class=\"md-plain\">Gao et al 2023<\/span><\/a><\/span><span class=\"md-plain\">) increases the number of tunable parameters compared to LLaMA-Adapter V1 (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2303.16199\"><span class=\"md-plain\">Zhang et al. 2023<\/span><\/a><\/span><span class=\"md-plain\">, the first difference is that it adds bias units to the fully connected (linear) layers. Since it merely modifies the existing linear layers from <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>input * weight<\/code><\/span><span class=\"md-plain\"> to <\/span><span class=\"md-pair-s\" spellcheck=\"false\"><code>input * weight + bias<\/code><\/span><span class=\"md-plain\">, it only has a small impact on the finetuning and inference performance.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">The second difference is that it makes the aforementioned RMSNorm layers trainable. While this has a small effect on the training performance due to updating additional parameters, it doesn&#8217;t impact the inference speed as it doesn&#8217;t add any new parameters to the network. <\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">Low-Rank Adaptation (LoRA)<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Low-Rank Adaptation (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2106.09685\"><span class=\"md-plain\">Hu et al 2021<\/span><\/a><\/span><span class=\"md-plain\">) is similar to the Adapter methods above in that it adds a small number of trainable parameters to the model while the original model parameters remain frozen. However, the underlying concept is fundamentally very different from the LLaMA-Adapter methods.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In short, LoRA decomposes a weight matrix into two smaller weight matrices, as illustrated below:<\/span><\/p>\n<div align=\"center\">\n<div id=\"attachment_5648264\" style=\"width: 295px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648264\" class=\"wp-image-5648264\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lora-weights.png\" alt=\"\" width=\"285\" height=\"300\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lora-weights.png 809w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lora-weights-285x300.png 285w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/lora-weights-285x300@2x.png 570w\" sizes=\"(max-width: 285px) 100vw, 285px\" \/><p id=\"caption-attachment-5648264\" class=\"wp-caption-text\">Outline of LoRA<\/p><\/div>\n<\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">For more details about LoRA, please see my longer, more technical article <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/lora-llm\/\"><span class=\"md-plain\">Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA)<\/span><\/a><\/span><span class=\"md-plain\">.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">Finetuning LLMs On Your Custom Dataset<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In this article, we ran a few performance benchmarks on the Alpaca (52k instructions) dataset. In practice, you may be curious about how to apply these methods to your own dataset. After all, the advantage of open-source LLMs is that we can finetune and customize them to our target data and tasks. <\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In essence, all it takes to use any of these LLMs and techniques on your own dataset is to ensure they are formatted in a standardized form, which is described in more detail in Aniket Maurya&#8217;s blog post <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/lightning.ai\/pages\/blog\/how-to-finetune-gpt-like-large-language-models-on-a-custom-dataset\/\"><span class=\"md-plain\">How To Finetune GPT Like Large Language Models on a Custom Dataset<\/span><\/a><\/span><span class=\"md-plain\">.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 class=\"md-end-block md-heading\"><span class=\"md-plain\">Conclusion<\/span><\/h2>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In this article, we saw how we can fientune a state-of-the-art open-source LLM like Falcon on a single GPU using LLaMA-Adapter methods on LoRA. <\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Whereas the conventional finetuning of all layers takes 9 hours and requires at least 6 A100 GPUs with 40 GB of RAM each, the parameter-efficient finetuning methods highlighted in this article can finetune the same model 9x faster on a single GPU, requiring 15x less GPU memory.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">If you are interested in adopting these methods for your own projects, check out the open-source <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/Lightning-AI\/lit-parrot\/\"><span class=\"md-plain\">Lit-Parrot<\/span><\/a><\/span><span class=\"md-plain\"> repository to get started.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-pair-s \"><strong><span class=\"md-plain\">Acknowledgements<\/span><\/strong><\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">I want to thank Carlos Mochol\u00ed, who has been a big help with fixing my LoRA scripts. Also, a big shoutout to Adrian W\u00e4lchli and Luca Antiga for integrating Falcon into the Lit-Parrot repository.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Finetuning allows us to adapt pretrained LLMs in a cost-efficient manner. But which method should we use? This article compares different parameter-efficient finetuning methods for the latest top-performing open-source LLM, Falcon. &nbsp; Pretraining and Finetuning LLMs Before we dive into the LLM finetuning details, let&#8217;s briefly recap how we train LLMs in general. LLMs are<a class=\"excerpt-read-more\" href=\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/\" title=\"ReadFinetuning Falcon LLMs More Efficiently With LoRA and Adapters\">&#8230; Read more &raquo;<\/a><\/p>\n","protected":false},"author":16,"featured_media":5648265,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27,106,41],"tags":[96,202,201,186,184,51],"glossary":[203,212,216,218],"acf":{"additional_authors":false,"mathjax":false,"default_editor":true,"show_table_of_contents":false,"hide_from_archive":false,"content_type":"Blog Post","sticky":true,"custom_styles":""},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Finetuning Falcon LLMs More Efficiently With LoRA and Adapters - Lightning AI<\/title>\n<meta name=\"description\" content=\"Using parameter-efficient finetuning methods outlined in this article, it&#039;s possible to finetune an open-source Falcon LLM in 1 hour on a single GPU instead of a day on 6 GPUs.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Finetuning Falcon LLMs More Efficiently With LoRA and Adapters - Lightning AI\" \/>\n<meta property=\"og:description\" content=\"Using parameter-efficient finetuning methods outlined in this article, it&#039;s possible to finetune an open-source Falcon LLM in 1 hour on a single GPU instead of a day on 6 GPUs.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/\" \/>\n<meta property=\"og:site_name\" content=\"Lightning AI\" \/>\n<meta property=\"article:published_time\" content=\"2023-06-15T02:10:05+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-22T18:37:03+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-14-at-9.09.01-PM.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1258\" \/>\n\t<meta property=\"og:image:height\" content=\"946\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"JP Hennessy\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:site\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"JP Hennessy\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"13 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/\"},\"author\":{\"name\":\"JP Hennessy\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\"},\"headline\":\"Finetuning Falcon LLMs More Efficiently With LoRA and Adapters\",\"datePublished\":\"2023-06-15T02:10:05+00:00\",\"dateModified\":\"2023-06-22T18:37:03+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/\"},\"wordCount\":2665,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-14-at-9.09.01-PM.png\",\"keywords\":[\"ai\",\"Efficiency\",\"Falcon\",\"finetuning\",\"llm\",\"pytorch\"],\"articleSection\":[\"Articles\",\"Community\",\"Tutorials\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/\",\"url\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/\",\"name\":\"Finetuning Falcon LLMs More Efficiently With LoRA and Adapters - Lightning AI\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-14-at-9.09.01-PM.png\",\"datePublished\":\"2023-06-15T02:10:05+00:00\",\"dateModified\":\"2023-06-22T18:37:03+00:00\",\"description\":\"Using parameter-efficient finetuning methods outlined in this article, it's possible to finetune an open-source Falcon LLM in 1 hour on a single GPU instead of a day on 6 GPUs.\",\"breadcrumb\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#primaryimage\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-14-at-9.09.01-PM.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-14-at-9.09.01-PM.png\",\"width\":1258,\"height\":946},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/lightning.ai\/pages\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Finetuning Falcon LLMs More Efficiently With LoRA and Adapters\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/lightning.ai\/pages\/#website\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"name\":\"Lightning AI\",\"description\":\"The platform for teams to build AI.\",\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/lightning.ai\/pages\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\",\"name\":\"Lightning AI\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"width\":1744,\"height\":856,\"caption\":\"Lightning AI\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/LightningAI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\",\"name\":\"JP Hennessy\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"caption\":\"JP Hennessy\"},\"url\":\"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Finetuning Falcon LLMs More Efficiently With LoRA and Adapters - Lightning AI","description":"Using parameter-efficient finetuning methods outlined in this article, it's possible to finetune an open-source Falcon LLM in 1 hour on a single GPU instead of a day on 6 GPUs.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/","og_locale":"en_US","og_type":"article","og_title":"Finetuning Falcon LLMs More Efficiently With LoRA and Adapters - Lightning AI","og_description":"Using parameter-efficient finetuning methods outlined in this article, it's possible to finetune an open-source Falcon LLM in 1 hour on a single GPU instead of a day on 6 GPUs.","og_url":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/","og_site_name":"Lightning AI","article_published_time":"2023-06-15T02:10:05+00:00","article_modified_time":"2023-06-22T18:37:03+00:00","og_image":[{"width":1258,"height":946,"url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-14-at-9.09.01-PM.png","type":"image\/png"}],"author":"JP Hennessy","twitter_card":"summary_large_image","twitter_creator":"@LightningAI","twitter_site":"@LightningAI","twitter_misc":{"Written by":"JP Hennessy","Est. reading time":"13 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#article","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/"},"author":{"name":"JP Hennessy","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6"},"headline":"Finetuning Falcon LLMs More Efficiently With LoRA and Adapters","datePublished":"2023-06-15T02:10:05+00:00","dateModified":"2023-06-22T18:37:03+00:00","mainEntityOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/"},"wordCount":2665,"commentCount":0,"publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-14-at-9.09.01-PM.png","keywords":["ai","Efficiency","Falcon","finetuning","llm","pytorch"],"articleSection":["Articles","Community","Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/","url":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/","name":"Finetuning Falcon LLMs More Efficiently With LoRA and Adapters - Lightning AI","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/#website"},"primaryImageOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#primaryimage"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-14-at-9.09.01-PM.png","datePublished":"2023-06-15T02:10:05+00:00","dateModified":"2023-06-22T18:37:03+00:00","description":"Using parameter-efficient finetuning methods outlined in this article, it's possible to finetune an open-source Falcon LLM in 1 hour on a single GPU instead of a day on 6 GPUs.","breadcrumb":{"@id":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#primaryimage","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-14-at-9.09.01-PM.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-14-at-9.09.01-PM.png","width":1258,"height":946},{"@type":"BreadcrumbList","@id":"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/lightning.ai\/pages\/"},{"@type":"ListItem","position":2,"name":"Finetuning Falcon LLMs More Efficiently With LoRA and Adapters"}]},{"@type":"WebSite","@id":"https:\/\/lightning.ai\/pages\/#website","url":"https:\/\/lightning.ai\/pages\/","name":"Lightning AI","description":"The platform for teams to build AI.","publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/lightning.ai\/pages\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/lightning.ai\/pages\/#organization","name":"Lightning AI","url":"https:\/\/lightning.ai\/pages\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","width":1744,"height":856,"caption":"Lightning AI"},"image":{"@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/LightningAI"]},{"@type":"Person","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6","name":"JP Hennessy","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","caption":"JP Hennessy"},"url":"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/"}]}},"_links":{"self":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5648254"}],"collection":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/comments?post=5648254"}],"version-history":[{"count":0,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5648254\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media\/5648265"}],"wp:attachment":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media?parent=5648254"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/categories?post=5648254"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/tags?post=5648254"},{"taxonomy":"glossary","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/glossary?post=5648254"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}