{"id":5647720,"date":"2023-04-12T22:09:40","date_gmt":"2023-04-13T02:09:40","guid":{"rendered":"https:\/\/lightning.ai\/pages\/?p=5647720"},"modified":"2023-06-22T13:31:06","modified_gmt":"2023-06-22T17:31:06","slug":"understanding-llama-adapters","status":"publish","type":"post","link":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/","title":{"rendered":"Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters"},"content":{"rendered":"<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Key takeaway<\/h3> Learn how popular parameter-efficient finetuning methods for LLM work: prefix tuning, adapters, and LLaMA-Adapter. <\/div>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">In the rapidly evolving field of artificial intelligence, utilizing large language models in an efficient and effective manner has become increasingly important.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">Parameter-efficient finetuning stands at the forefront of this pursuit, allowing researchers and practitioners to reuse pretrained models while minimizing their computational and resource footprints. It also allows us to train AI models on a broader range of hardware, including devices with limited computational power, such as laptops, smartphones, and IoT devices. Lastly, with the increasing focus on environmental sustainability, parameter-efficient finetuning reduces the energy consumption and carbon footprint associated with training large-scale AI models.<\/span><\/p>\n<p class=\"md-end-block md-p\"><span class=\"md-plain\">This article explains the broad concept of finetuning and discusses popular parameter-efficient alternatives like prefix tuning and adapters. Finally, we will look at the recent LLaMA-Adapter method and see how we can use it in practice.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"finetuning-large-language-models\">Finetuning Large Language Models<\/h2>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain\">Since GPT-2 (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/d4mucfpksywv.cloudfront.net\/better-language-models\/language_models_are_unsupervised_multitask_learners.pdf\"><span class=\"md-plain\">Radford et al.<\/span><\/a><\/span><span class=\"md-plain\">) and GPT-3 (<\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2005.14165\"><span class=\"md-plain\">Brown et al.<\/span><\/a><\/span><span class=\"md-plain md-expand\">), we have seen that generative large language models (LLMs) pretrained on a general text corpus are capable of in-context learning, which doesn&#8217;t require us to further train or finetune pretrained LLMs if we want to perform specific or new tasks that the LLM wasn&#8217;t explicitly trained on. Instead, we can directly provide a few examples of a target task via the input prompt, as illustrated in the example below.<\/p>\n<p>&nbsp;<br \/>\n<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-5647721 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/in-context.png\" alt=\"In-context learning example.\" width=\"476\" height=\"242\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/in-context.png 1340w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/in-context-300x152.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/in-context-1024x520.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/in-context-300x152@2x.png 600w\" sizes=\"(max-width: 476px) 100vw, 476px\" \/><br \/>\n&nbsp;<\/p>\n<p>In-context learning is a valuable and user-friendly method for situations where direct access to the large language model (LLM) is limited, such as when interacting with the LLM through an API or user interface.<br \/>\nHowever, if we have access to the LLM, adapting and finetuning it on a target task using data from a target domain usually leads to superior results. So, how can we adapt a model to a target task? There are three conventional approaches outlined in the figure below.<\/p>\n<p>&nbsp;<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"wp-image-5647722 alignnone\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-flowchart.png\" alt=\"the three classic finetuning approaches\" width=\"904\" height=\"315\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-flowchart.png 2394w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-flowchart-300x105.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-flowchart-1024x357.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-flowchart-1536x535.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-flowchart-2048x713.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-flowchart-300x105@2x.png 600w\" sizes=\"(max-width: 904px) 100vw, 904px\" \/><br \/>\n&nbsp;<\/p>\n<p><span data-preserver-spaces=\"true\">These methods above are compatible with generative (decoder-style) models such as GPT and embedding-focused (encoder-style) models such as BERT. In contrast to these three approaches, in-context learning only applies to generative models. It&#8217;s also worth highlighting that when we finetune generative models, we work with and build on the embeddings they create instead of the generated output texts.<\/span><\/p>\n<p>&nbsp;<br \/>\n<strong><span data-preserver-spaces=\"true\">Feature-based approach<\/span><\/strong><\/p>\n<p><span data-preserver-spaces=\"true\">In the feature-based approach, we load a pretrained LLM and apply it to our target dataset. Here, we are particularly interested in generating the output embeddings for the training set, which we can use as input features to train a classification model. While this approach is particularly common for embedding-focused like BERT, we can also extract embeddings from generative GPT-style models (you can find an example in our blog post\u00a0<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/lightning.ai\/pages\/blog\/gradient-accumulation\/\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">here<\/span><\/a><span data-preserver-spaces=\"true\">).<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">The classification model can then be a logistic regression model, a random forest, or XGBoost &#8212; whatever our hearts desire. (However, based on my experience, linear classifiers like logistic regression perform best here.)<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Conceptually, we can illustrate the feature-based approach with the following code:<\/span><\/p>\n<p>&nbsp;<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-600 \" style=\"--height:600px\"><code class=\"language-python\"><br \/>\nmodel = AutoModel.from_pretrained(\"distilbert-base-uncased\")\n\n# ...<br \/>\n# tokenize dataset<br \/>\n# ...\n\n# generate embeddings<br \/>\n@torch.inference_mode()<br \/>\ndef get_output_embeddings(batch):<br \/>\n    output = model(<br \/>\n        batch[\"input_ids\"],<br \/>\n        attention_mask=batch[\"attention_mask\"]<br \/>\n    ).last_hidden_state[:, 0]<br \/>\nreturn {\"features\": output}\n\ndataset_features = dataset_tokenized.map(<br \/>\n  get_output_embeddings, batched=True, batch_size=10)\n\nX_train = np.array(dataset_features[\"train\"][\"features\"])<br \/>\ny_train = np.array(dataset_features[\"train\"][\"label\"])\n\nX_val = np.array(dataset_features[\"validation\"][\"features\"])<br \/>\ny_val = np.array(dataset_features[\"validation\"][\"label\"])\n\nX_test = np.array(dataset_features[\"test\"][\"features\"])<br \/>\ny_test = np.array(dataset_features[\"test\"][\"label\"])\n\n# train classifier<br \/>\nfrom sklearn.linear_model import LogisticRegression\n\nclf = LogisticRegression()<br \/>\nclf.fit(X_train, y_train)\n\nprint(\"Training accuracy\", clf.score(X_train, y_train))<br \/>\nprint(\"Validation accuracy\", clf.score(X_val, y_val))<br \/>\nprint(\"test accuracy\", clf.score(X_test, y_test))<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>&nbsp;<\/p>\n<p>(Interested readers can find the full code example <a class=\"notion-link-token notion-enable-hover\" href=\"https:\/\/github.com\/rasbt\/blog-finetuning-llama-adapters\/blob\/main\/three-conventional-methods\/1_distilbert-feature-extractor.ipynb\" rel=\"noopener noreferrer\" data-token-index=\"1\"><span class=\"link-annotation-unknown-block-id-801242578\">here<\/span><\/a>.)<\/p>\n<p>&nbsp;<br \/>\n<strong><span data-preserver-spaces=\"true\">Finetuning I &#8212; Updating The Output Layers<\/span><\/strong><\/p>\n<p><span data-preserver-spaces=\"true\">A popular approach related to the feature-based approach described above is finetuning the output layers (we will refer to this approach as\u00a0<\/span><em><span data-preserver-spaces=\"true\">finetuning I<\/span><\/em><span data-preserver-spaces=\"true\">). Similar to the feature-based approach, we keep the parameters of the pretrained LLM frozen. We only train the newly added output layers, analogous to training a logistic regression classifier or small multilayer perceptron on the embedded features.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">In code, this would look as follows:<\/span><\/p>\n<p>&nbsp;<br \/>\n<pre class=\"code-shortcode dark-theme window- collapse-600 \" style=\"--height:600px\"><code class=\"language-python\"><br \/>\nmodel = AutoModelForSequenceClassification.from_pretrained(<br \/>\n    \"distilbert-base-uncased\",<br \/>\n     num_labels=2  # suppose target task is a binary classification task<br \/>\n) \n\n# freeze all layers<br \/>\nfor param in model.parameters():<br \/>\n    param.requires_grad = False\n\n# then unfreeze the two last layers (output layers)<br \/>\nfor param in model.pre_classifier.parameters():<br \/>\n    param.requires_grad = True\n\nfor param in model.classifier.parameters():<br \/>\n    param.requires_grad = True\n\n# finetune model<br \/>\nlightning_model = CustomLightningModule(model)\n\ntrainer = L.Trainer(<br \/>\n    max_epochs=3,<br \/>\n    ...<br \/>\n)\n\ntrainer.fit(<br \/>\n  model=lightning_model,<br \/>\n  train_dataloaders=train_loader,<br \/>\n  val_dataloaders=val_loader)\n\n# evaluate model<br \/>\ntrainer.test(lightning_model, dataloaders=test_loader)<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/p>\n<p>&nbsp;<\/p>\n<p><span data-preserver-spaces=\"true\">(Interested readers can find the complete code example\u00a0<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/github.com\/rasbt\/blog-finetuning-llama-adapters\/blob\/main\/three-conventional-methods\/2_finetune-last-layers.ipynb\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">here<\/span><\/a><span data-preserver-spaces=\"true\">.)<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">In theory, this approach should perform similarly well, in terms of modeling performance and speed, as the feature-based approach since we use the same frozen backbone model. However, since the feature-based approach makes it slightly easier to pre-compute and store the embedded features for the training dataset, the feature-based approach may be more convenient for specific practical scenarios.<\/span><\/p>\n<p>&nbsp;<br \/>\n<strong><span data-preserver-spaces=\"true\">Finetuning II &#8212; Updating All Layers<\/span><\/strong><\/p>\n<p><span data-preserver-spaces=\"true\">While the original BERT paper (<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/arxiv.org\/abs\/1810.04805\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">Devlin et al.<\/span><\/a><span data-preserver-spaces=\"true\">) reported that finetuning only the output layer can result in modeling performance comparable to finetuning all layers, which is substantially more expensive since more parameters are involved. For instance, a BERT base model has approximately 110 million parameters. However, the final layer of a BERT base model for binary classification consists of merely 1,500 parameters. Furthermore, the last two layers of a BERT base model account for 60,000 parameters &#8212; that&#8217;s only around 0.6% of the total model size.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Our mileage will vary based on how similar our target task and target domain is to the dataset the model was pretrained on. But in practice, finetuning all layers almost always results in superior modeling performance.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">So, when optimizing the modeling performance, the gold standard for using pretrained LLMs is to update all layers (here referred to as finetuning II). Conceptually finetuning II is very similar to finetuning I. The only difference is that we do not freeze the parameters of the pretrained LLM but finetune them as well:<\/span><\/p>\n<p>&nbsp;<br \/>\n<pre class=\"code-shortcode dark-theme window- collapse-600 \" style=\"--height:600px\"><code class=\"language-python\"><br \/>\nmodel = AutoModelForSequenceClassification.from_pretrained(<br \/>\n    \"distilbert-base-uncased\",<br \/>\n     num_labels=2  # suppose target task is a binary classification task<br \/>\n) \n\n# don't freeze layers<br \/>\n# for param in model.parameters():<br \/>\n#    param.requires_grad = False\n\n# finetune model<br \/>\nlightning_model = LightningModel(model)\n\ntrainer = L.Trainer(<br \/>\n    max_epochs=3,<br \/>\n    ...<br \/>\n)\n\ntrainer.fit(<br \/>\n  model=lightning_model,<br \/>\n  train_dataloaders=train_loader,<br \/>\n  val_dataloaders=val_loader)\n\n# evaluate model<br \/>\ntrainer.test(lightning_model, dataloaders=test_loader)<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/p>\n<p>&nbsp;<\/p>\n<p>(Interested readers can find the complete code example here.)<\/p>\n<p>If you are curious about some real-world results, the code snippets above were used to train a movie review classifier using a pretrained DistilBERT base model (you can access the code notebooks here):<br \/>\nFeature-based approach with logistic regression: 83% test accuracy<br \/>\nFinetuning I, updating the last 2 layers: 87% accuracy<br \/>\nFinetuning II, updating all layers: 92% accuracy.<br \/>\nThese results are consistent with the general rule of thumb that finetuning more layers often results in better performance, but it comes with increased cost.<\/p>\n<p>&nbsp;<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\"wp-image-5647724 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-performance.png\" alt=\"finetuning performance trade-offs\" width=\"657\" height=\"242\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-performance.png 1454w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-performance-300x111.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-performance-1024x377.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/classic-performance-300x111@2x.png 600w\" sizes=\"(max-width: 657px) 100vw, 657px\" \/><br \/>\n&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"parameter-efficient-finetuning\"><strong><span data-preserver-spaces=\"true\">Parameter-Efficient Finetuning<\/span><\/strong><\/h2>\n<p><span data-preserver-spaces=\"true\">In the previous sections, we learned that finetuning more layers usually leads to better results. Now, the experiments above are based on a DistilBERT model, which is relatively small. What if we want to finetune larger models that only barely fit into GPU memory, for example, the latest generative LLMs? We can use the feature-based or finetuning I approach above, of course. But suppose we want to get a similar modeling quality as finetuning II?<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Over the years, researchers developed several techniques (<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/arxiv.org\/abs\/2303.15647\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">Lialin et al.<\/span><\/a><span data-preserver-spaces=\"true\">) to finetune LLM with high modeling performance while only requiring the training of only a small number of parameters. These methods are usually referred to as parameter-efficient finetuning techniques (PEFT).<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Some of the most widely used PEFT techniques are summarized in the figure below.<\/span><\/p>\n<p>&nbsp;<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\" wp-image-5647725 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/popular-methods.png\" alt=\"popular LLM finetuning methods\" width=\"629\" height=\"173\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/popular-methods.png 2262w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/popular-methods-300x82.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/popular-methods-1024x282.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/popular-methods-1536x422.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/popular-methods-2048x563.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/popular-methods-300x82@2x.png 600w\" sizes=\"(max-width: 629px) 100vw, 629px\" \/><br \/>\n&nbsp;<\/p>\n<p><span data-preserver-spaces=\"true\">One PEFT technique that recently made big waves is LLaMA-Adapter, which was proposed for Meta&#8217;s popular LLaMA model (<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/arxiv.org\/abs\/2302.13971\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">Touvron et al.<\/span><\/a><span data-preserver-spaces=\"true\">) &#8212; however, while LLaMA-Adapter was proposed in the context of LLaMA, the idea is model-agnostic.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">To understand how LLaMA-Adapter works, we have to take a (small) step back and review two related techniques called\u00a0<\/span><em><span data-preserver-spaces=\"true\">prefix tuning<\/span><\/em><span data-preserver-spaces=\"true\">\u00a0and\u00a0<\/span><em><span data-preserver-spaces=\"true\">adapters<\/span><\/em><span data-preserver-spaces=\"true\">\u00a0&#8212; LLaMA-Adapter (<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/arxiv.org\/abs\/2303.16199\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">Zhang et al.<\/span><\/a><span data-preserver-spaces=\"true\">) combines and extends both of these ideas.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">So, in the remainder of this article, we will discuss the various concepts of prompt modifications to understand prefix tuning and adapter methods before we take a closer look at LLaMA-Adapter. (And we will save low-rank adaptation for a future article.)<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"prompt-tuning-and-prefix-tuning\"><strong><span data-preserver-spaces=\"true\">Prompt Tuning And Prefix Tuning<\/span><\/strong><\/h2>\n<p><span data-preserver-spaces=\"true\">The original concept of prompt tuning refers to techniques that vary the input prompt to achieve better modeling results. For example, suppose we are interested in translating an English sentence into German. We can ask the model in various different ways, as illustrated below.<\/span><\/p>\n<p>&nbsp;<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\" wp-image-5647726 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/hard-prompting.png\" alt=\"an example of hard-prompting\" width=\"776\" height=\"162\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/hard-prompting.png 1582w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/hard-prompting-300x63.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/hard-prompting-1024x214.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/hard-prompting-1536x320.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/hard-prompting-300x63@2x.png 600w\" sizes=\"(max-width: 776px) 100vw, 776px\" \/><br \/>\n&nbsp;<\/p>\n<p><span data-preserver-spaces=\"true\">Now, this concept illustrated above is referred to as\u00a0<\/span><em><span data-preserver-spaces=\"true\">hard<\/span><\/em><span data-preserver-spaces=\"true\">\u00a0prompt tuning since we directly change the discrete input tokens, which are not differentiable.\u00a0<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">In contrast to\u00a0<\/span><em><span data-preserver-spaces=\"true\">hard<\/span><\/em><span data-preserver-spaces=\"true\">\u00a0prompt tuning,\u00a0<\/span><em><span data-preserver-spaces=\"true\">soft<\/span><\/em><span data-preserver-spaces=\"true\">\u00a0prompt tuning concatenates the embeddings of the input tokens with a trainable tensor that can be optimized via backpropagation to improve the modeling performance on a target task.\u00a0<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">A specific flavor of prompt tuning is prefix tuning (<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/arxiv.org\/abs\/2101.00190\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">Li and Liang<\/span><\/a><span data-preserver-spaces=\"true\">). The idea in prefix tuning is to add a trainable tensor to each transformer block instead of only the input embeddings, as in\u00a0<\/span><em><span data-preserver-spaces=\"true\">soft<\/span><\/em><span data-preserver-spaces=\"true\">\u00a0prompt tuning. The following figure illustrates the difference between a regular transformer block and a transformer block modified with a prefix.<\/span><\/p>\n<p>&nbsp;<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\" wp-image-5647727 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-tuning.png\" alt=\"prefix-tuning for LLMs\" width=\"852\" height=\"503\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-tuning.png 2104w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-tuning-300x177.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-tuning-1024x604.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-tuning-1536x907.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-tuning-2048x1209.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-tuning-300x177@2x.png 600w\" sizes=\"(max-width: 852px) 100vw, 852px\" \/><br \/>\n&nbsp;<\/p>\n<p>Note that in the figure above, the &#8220;fully connected layers&#8221; refer to a small multilayer perceptron (two fully connected layers with a nonlinear activation function in-between). These fully connected layers embed the soft prompt in a feature space with the same dimensionality as the transformer-block input to ensure compatibility for concatenation.<br \/>\nUsing (Python) pseudo-code, we can illustrate the difference between a regular transformer block and a prefix-modified transformer block as follows:<\/p>\n<p>&nbsp;<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\" wp-image-5647728 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-code.png\" alt=\"transformer blog with prefix code\" width=\"609\" height=\"318\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-code.png 1360w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-code-300x157.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-code-1024x535.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/prefix-code-300x157@2x.png 600w\" sizes=\"(max-width: 609px) 100vw, 609px\" \/><br \/>\n&nbsp;<\/p>\n<p><span data-preserver-spaces=\"true\">According to the original\u00a0<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/arxiv.org\/abs\/2101.00190\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">prefix tuning<\/span><\/a><span data-preserver-spaces=\"true\">\u00a0paper, prefix tuning achieves comparable modeling performance to finetuning all layers while only requiring the training of 0.1% of the parameters &#8212; the experiments were based on GPT-2 models. Moreover, in many cases, prefix tuning even outperformed the finetuning of all layers, which is likely because fewer parameters are involved, which helps reduce overfitting on smaller target datasets.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Lastly, to clarify the use of soft prompts during inference: after learning a soft prompt, we have to supply it as a prefix when performing the specific task we finetuned the model on. This allows the model to tailor its responses to that particular task. Moreover, we can have multiple soft prompts, each corresponding to a different task, and provide the appropriate prefix during inference to achieve optimal results for a particular task.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"adapters\"><strong><span data-preserver-spaces=\"true\">Adapters<\/span><\/strong><\/h2>\n<p><span data-preserver-spaces=\"true\">The original\u00a0<\/span><em><span data-preserver-spaces=\"true\">adapter<\/span><\/em><span data-preserver-spaces=\"true\">\u00a0method (<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/arxiv.org\/abs\/1902.00751\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">Houlsby et al.<\/span><\/a><span data-preserver-spaces=\"true\">) is somewhat related to the aforementioned\u00a0<\/span><em><span data-preserver-spaces=\"true\">prefix tuning<\/span><\/em><span data-preserver-spaces=\"true\">\u00a0as they also add additional parameters to each transformer block. However, instead of prepending prefixes to the input embeddings, the adapter method adds adapter layers in two places, as illustrated in the figure below.<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\" wp-image-5647729 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter-outline.png\" alt=\"\" width=\"827\" height=\"442\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter-outline.png 2296w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter-outline-300x160.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter-outline-1024x548.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter-outline-1536x822.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter-outline-2048x1095.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter-outline-300x160@2x.png 600w\" sizes=\"(max-width: 827px) 100vw, 827px\" \/><\/p>\n<p>And for readers who prefer (Python) pseudo-code, the adapter layer can be written as follows:<\/p>\n<p>&nbsp;<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\" wp-image-5647730 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter.png\" alt=\"LLM Adapter Code\" width=\"397\" height=\"280\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter.png 1118w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter-300x211.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter-1024x722.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/adapter-300x211@2x.png 600w\" sizes=\"(max-width: 397px) 100vw, 397px\" \/><br \/>\n&nbsp;<\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain md-expand\">Note that the fully connected layers of the adapters are usually relatively small and have a bottleneck structure similar to autoencoders. Each adapter block&#8217;s first fully connected layer projects the input down onto a low-dimensional representation. The second fully connected layer projects the input back into the input dimension. How is this parameter efficient? For example, assume the first fully connected layer projects a 1024-dimensional input down to 24 dimensions, and the second fully connected layer projects it back into 1024 dimensions. This means we introduced 1,024 x 24 + 24 x 1,024 = 49,152 weight parameters. In contrast, a single fully connected layer that reprojects a 1024-dimensional input into a 1,024-dimensional space would have 1,024 x 1024 = 1,048,576 parameters.<\/span><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain\">According to the original <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/1902.00751\"><span class=\"md-plain\">adapter paper<\/span><\/a><\/span><span class=\"md-plain\">, a BERT model trained with the adapter method reaches a modeling performance comparable to a fully finetuned BERT model while only requiring the training of 3.6% of the parameters.<\/span><\/p>\n<p class=\"md-end-block md-p md-focus\"><span class=\"md-plain\">Now, the question is how the adapter method compares to prefix tuning. Based on the original <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/arxiv.org\/abs\/2101.00190\"><span class=\"md-plain\">prefix tuning paper<\/span><\/a><\/span><span class=\"md-plain md-expand\">, the adapter method performed slightly worse than the prefix tuning method when 0.1% of the total number of model parameters were tuned. However, when the adapter method is used to tune 3% of the model parameters, the method ties with prefix tuning of 0.1% of the model parameters. So, we may conclude that the prefix tuning method is the more efficient of the two.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"extending-prefix-tuning-and-adapters-llama-adapter\"><strong><span data-preserver-spaces=\"true\">Extending Prefix Tuning and Adapters: LLaMA-Adapter<\/span><\/strong><\/h2>\n<p><span data-preserver-spaces=\"true\">Extending the ideas of prefix tuning and the original adapter method, researchers recently proposed LLaMA-Adapter (<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/arxiv.org\/abs\/2303.16199\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">Zhang et al.<\/span><\/a><span data-preserver-spaces=\"true\">), a parameter-efficient finetuning method for\u00a0<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/github.com\/facebookresearch\/llama\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">LLaMA<\/span><\/a><span data-preserver-spaces=\"true\">\u00a0(LLaMA is a popular GPT-alternative by Meta).<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Like\u00a0<\/span><em><span data-preserver-spaces=\"true\">prefix tuning<\/span><\/em><span data-preserver-spaces=\"true\">, the LLaMA-Adapter method prepends tunable prompt tensors to the embedded inputs. It&#8217;s worth noting that in the LLaMA-Adapter method, the prefix is learned and maintained within an embedding table rather than being provided externally. Each transformer block in the model has its own distinct learned prefix, allowing for more tailored adaptation across different model layers.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">In addition, LLaMA-Adapter introduces a zero-initialized attention mechanism coupled with gating. The motivation behind this so-called\u00a0<\/span><em><span data-preserver-spaces=\"true\">zero-init<\/span><\/em><span data-preserver-spaces=\"true\">\u00a0attention and gating is that adapters and prefix tuning could potentially disrupt the linguistic knowledge of the pretrained LLM by incorporating randomly initialized tensors (prefix prompts or adapter layers), resulting in unstable finetuning and high loss values during initial training phases.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Another difference compared to prefix tuning and the original adapter method is that LLaMA-Adapter adds the learnable adaption prompts only to the\u00a0<\/span><em><span data-preserver-spaces=\"true\">L<\/span><\/em><span data-preserver-spaces=\"true\">\u00a0topmost transformer layers instead of all transformer layers. The authors argue that this approach enables more effective tuning of language representations focusing on higher-level semantic information.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">While the basic idea of the LLaMA adapter method is related to prefix tuning (prepending tunable soft prompts), there are some additional, subtle differences in how this is implemented. For instance, only a self-attention input&#8217;s key and value sequences are modified via the tunable soft prompt. Then, depending on the gating factor (which is set to zero at the beginning of the training), the prefix-modified attention is either used or not. This concept is illustrated in the visualization below.<\/span><\/p>\n<p>&nbsp;<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\" wp-image-5647731 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter.png\" alt=\"llama-adapter outline\" width=\"814\" height=\"863\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter.png 1696w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-283x300.png 283w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-965x1024.png 965w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-1447x1536.png 1447w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-283x300@2x.png 566w\" sizes=\"(max-width: 814px) 100vw, 814px\" \/><br \/>\n&nbsp;<\/p>\n<p>In pseudo-code, we may express this as follows:<\/p>\n<p>&nbsp;<br \/>\n<img loading=\"lazy\" decoding=\"async\" class=\" wp-image-5647732 aligncenter\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code-1.png\" alt=\"llama-adapter pseudo-code\" width=\"702\" height=\"310\" \/><br \/>\n&nbsp;<\/p>\n<p><span data-preserver-spaces=\"true\">In short, the differences between LLaMA-Adapter and regular prefix tuning are that LLaMA-Adapter only modifies the top (i.e., the first few) transformer blocks and introduces a gating mechanism to stabilize the training. While the researchers specifically experiment with LLaMA, their proposed Adapter method is a general method that can also be applied to other types of LLMs (like GPT).<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Using the LLaMA-Adapter approach, the researchers were able to finetune a 7 billion parameter LLaMA model in only 1 hour (using eight A100 GPUs) on a dataset consisting of 52k instruction pairs. Furthermore, the finetuned LLaMA-Adapter model outperformed all other models compared in this study on question-answering tasks, while only 1.2 M parameters (the adapter layers) needed to be finetuned.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">If you want to check out the LLaMA-Adapter method, you can find the original implementation on top of the GPL-licensed LLaMA code\u00a0<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/github.com\/ZrrSkywalker\/LLaMA-Adapter\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">here<\/span><\/a><span data-preserver-spaces=\"true\">.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Alternatively, if your use cases are incompatible with the GPL license, which requires you to open source all derivative works under a similar license, check out the\u00a0<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/github.com\/Lightning-AI\/lit-llama\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">Lit-LLaMA GitHub repository<\/span><\/a><span data-preserver-spaces=\"true\">. Lit-LLaMA is a readable implementation of LLaMA on top of the Apache-licensed nanoGPT code, which has less restrictive licensing terms.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Specifically, if you are interested in finetuning a LLaMA model using the LLaMA-Apapter method, you can run the<\/span><\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\npython finetune_adapter.py<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p><span class=\"md-plain\">script from the <\/span><span class=\"md-meta-i-c md-link\"><a href=\"https:\/\/github.com\/Lightning-AI\/lit-llama\"><span class=\"md-plain\">Lit-LLaMA GitHub repository<\/span><\/a><\/span><span class=\"md-plain\">.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"conclusion\"><strong><span data-preserver-spaces=\"true\">Conclusion<\/span><\/strong><\/h2>\n<p><span data-preserver-spaces=\"true\">Finetuning pre-trained large language models (LLMs) is an effective method to tailor these models to suit specific business requirements and align them with target domain data. This process involves adjusting the model parameters using a smaller dataset relevant to the desired domain, which enables the model to learn domain-specific knowledge and vocabulary.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">However, as LLMs are &#8220;large,&#8221; updating multiple layers in a transformer model can be very expensive, so researchers started developing parameter-efficient alternatives.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">In this article, we discussed several parameter-efficient alternatives to the conventional LLM finetuning mechanism. In particular, we covered prepending tunable soft prompts via prefix tuning and inserting additional adapter layers.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">Finally, we discussed the recent and popular LLaMA-Adapter method that prepends tunable soft prompts and introduces an additional gating mechanism to stabilize the training.<\/span><\/p>\n<p><span data-preserver-spaces=\"true\">If you want to try this out in practice, check out\u00a0<\/span><a class=\"editor-rtfLink\" href=\"https:\/\/github.com\/Lightning-AI\/lit-llama\" target=\"_blank\" rel=\"noopener\"><span data-preserver-spaces=\"true\">the Lit-LLaMA repository<\/span><\/a><span data-preserver-spaces=\"true\">\u00a0&#8212; questions and suggestions for additional parameter-efficient finetuning methods are very welcome! (Preferably via the \ud83e\udd99<a href=\"https:\/\/discord.com\/invite\/XncpTy7DSt\" rel=\"noopener\" target=\"_blank\">lit-llama channel on Discord<\/a>) <\/span><\/p>\n<p><strong><span data-preserver-spaces=\"true\">Acknowledgments<\/span><\/strong><\/p>\n<p><span data-preserver-spaces=\"true\">I want to thank Carlos Mocholi, Luca Antiga, and Adrian Waelchli for the constructive feedback to improve the clarity of this article.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In the rapidly evolving field of artificial intelligence, utilizing large language models in an efficient and effective manner has become increasingly important. Parameter-efficient finetuning stands at the forefront of this pursuit, allowing researchers and practitioners to reuse pretrained models while minimizing their computational and resource footprints. It also allows us to train AI models on<a class=\"excerpt-read-more\" href=\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/\" title=\"ReadUnderstanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters\">&#8230; Read more &raquo;<\/a><\/p>\n","protected":false},"author":16,"featured_media":5647732,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27,41],"tags":[189,188,51,62],"glossary":[203,217],"acf":{"additional_authors":false,"default_editor":true,"show_table_of_contents":true,"hide_from_archive":false,"content_type":"Blog Post","sticky":false,"custom_styles":"","table_of_contents":"<h4>Table of Contents<\/h4>\n<ul>\n<li><a href=\"#finetuning-large-language-models\">Finetuning Large Language Models<\/a><\/li>\n<li><a href=\"#parameter-efficient-finetuning\">Parameter-Efficient Finetuning<\/a><\/li>\n<li><a href=\"#prompt-tuning-and-prefix-tuning\">Prompt Tuning And Prefix Tuning<\/a>\n<li><a href=\"#adapters\">Adapters<\/a>\n<li><a href=\"#extending-prefix-tuning-and-adapters-llama-adapter\">Extending Prefix Tuning and Adapters: LLaMA-Adapter<\/a>\n<li><a href=\"#conclusion\">Conclusion<\/a>\n<\/ul>\n","mathjax":false},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters\" \/>\n<meta property=\"og:description\" content=\"In the rapidly evolving field of artificial intelligence, utilizing large language models in an efficient and effective manner has become increasingly important. Parameter-efficient finetuning stands at the forefront of this pursuit, allowing researchers and practitioners to reuse pretrained models while minimizing their computational and resource footprints. It also allows us to train AI models on... Read more &raquo;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/\" \/>\n<meta property=\"og:site_name\" content=\"Lightning AI\" \/>\n<meta property=\"article:published_time\" content=\"2023-04-13T02:09:40+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-22T17:31:06+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code.png\" \/>\n\t<meta property=\"og:image:width\" content=\"2094\" \/>\n\t<meta property=\"og:image:height\" content=\"924\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"JP Hennessy\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:site\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"JP Hennessy\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/\"},\"author\":{\"name\":\"JP Hennessy\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\"},\"headline\":\"Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters\",\"datePublished\":\"2023-04-13T02:09:40+00:00\",\"dateModified\":\"2023-06-22T17:31:06+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/\"},\"wordCount\":2853,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code.png\",\"keywords\":[\"GPT\",\"LLMs\",\"pytorch\",\"pytorch lightning\"],\"articleSection\":[\"Articles\",\"Tutorials\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/\",\"url\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/\",\"name\":\"Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code.png\",\"datePublished\":\"2023-04-13T02:09:40+00:00\",\"dateModified\":\"2023-06-22T17:31:06+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#primaryimage\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code.png\",\"width\":2094,\"height\":924,\"caption\":\"llama-adapter pseudo-code\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/lightning.ai\/pages\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/lightning.ai\/pages\/#website\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"name\":\"Lightning AI\",\"description\":\"The platform for teams to build AI.\",\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/lightning.ai\/pages\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\",\"name\":\"Lightning AI\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"width\":1744,\"height\":856,\"caption\":\"Lightning AI\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/LightningAI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\",\"name\":\"JP Hennessy\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"caption\":\"JP Hennessy\"},\"url\":\"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/","og_locale":"en_US","og_type":"article","og_title":"Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters","og_description":"In the rapidly evolving field of artificial intelligence, utilizing large language models in an efficient and effective manner has become increasingly important. Parameter-efficient finetuning stands at the forefront of this pursuit, allowing researchers and practitioners to reuse pretrained models while minimizing their computational and resource footprints. It also allows us to train AI models on... Read more &raquo;","og_url":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/","og_site_name":"Lightning AI","article_published_time":"2023-04-13T02:09:40+00:00","article_modified_time":"2023-06-22T17:31:06+00:00","og_image":[{"width":2094,"height":924,"url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code.png","type":"image\/png"}],"author":"JP Hennessy","twitter_card":"summary_large_image","twitter_creator":"@LightningAI","twitter_site":"@LightningAI","twitter_misc":{"Written by":"JP Hennessy","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#article","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/"},"author":{"name":"JP Hennessy","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6"},"headline":"Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters","datePublished":"2023-04-13T02:09:40+00:00","dateModified":"2023-06-22T17:31:06+00:00","mainEntityOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/"},"wordCount":2853,"commentCount":0,"publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code.png","keywords":["GPT","LLMs","pytorch","pytorch lightning"],"articleSection":["Articles","Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/","url":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/","name":"Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/#website"},"primaryImageOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#primaryimage"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code.png","datePublished":"2023-04-13T02:09:40+00:00","dateModified":"2023-06-22T17:31:06+00:00","breadcrumb":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#primaryimage","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/04\/llama-adapter-code.png","width":2094,"height":924,"caption":"llama-adapter pseudo-code"},{"@type":"BreadcrumbList","@id":"https:\/\/lightning.ai\/pages\/community\/article\/understanding-llama-adapters\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/lightning.ai\/pages\/"},{"@type":"ListItem","position":2,"name":"Understanding Parameter-Efficient Finetuning of Large Language Models: From Prefix Tuning to LLaMA-Adapters"}]},{"@type":"WebSite","@id":"https:\/\/lightning.ai\/pages\/#website","url":"https:\/\/lightning.ai\/pages\/","name":"Lightning AI","description":"The platform for teams to build AI.","publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/lightning.ai\/pages\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/lightning.ai\/pages\/#organization","name":"Lightning AI","url":"https:\/\/lightning.ai\/pages\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","width":1744,"height":856,"caption":"Lightning AI"},"image":{"@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/LightningAI"]},{"@type":"Person","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6","name":"JP Hennessy","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","caption":"JP Hennessy"},"url":"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/"}]}},"_links":{"self":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5647720"}],"collection":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/comments?post=5647720"}],"version-history":[{"count":0,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5647720\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media\/5647732"}],"wp:attachment":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media?parent=5647720"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/categories?post=5647720"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/tags?post=5647720"},{"taxonomy":"glossary","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/glossary?post=5647720"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}