{"id":5648765,"date":"2023-09-14T14:03:20","date_gmt":"2023-09-14T18:03:20","guid":{"rendered":"https:\/\/lightning.ai\/pages\/?p=5648765"},"modified":"2023-09-15T11:53:12","modified_gmt":"2023-09-15T15:53:12","slug":"optimizing-llms-from-a-dataset-perspective","status":"publish","type":"post","link":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/","title":{"rendered":"Optimizing LLMs from a Dataset Perspective"},"content":{"rendered":"<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Takeaways<\/h3><span style=\"font-weight: 400;\">Discover new research directions to improve Large Language Models (LLMs) and learn how to enhance the performance of instruction-finetuned LLMs by concentrating on higher-quality data and exploring diverse dataset sources.<\/span><\/div>\n<p><span style=\"font-weight: 400;\">This article focuses on improving the modeling performance of LLMs by finetuning them using carefully curated datasets. Specifically, this article highlights strategies that involve modifying, utilizing, or manipulating the datasets for instruction-based finetuning rather than altering the model architecture or training algorithms (the latter will be topics of a future article). This article will also explain how you can prepare your own datasets to finetune open-source LLMs.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Note that the <\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\"><span style=\"font-weight: 400;\">NeurIPS LLM Efficiency Challenge<\/span><\/a><span style=\"font-weight: 400;\"> is currently underway, aiming to train a Large Language Model on a single GPU within a 24-hour period, which is super interesting for practitioners and researchers interested in LLM efficiency. The techniques discussed in this article have direct relevance to this competition, and we will delve into how these dataset-centric strategies could potentially be applied within the challenge setting. Additionally, the article will offer suggestions for new experiments you might consider trying.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc1\"><span style=\"font-weight: 400;\">Supervised Instruction Finetuning<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">What is instruction-finetuning, and why should we care?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Instruction finetuning is a method used to improve the performance of language models like ChatGPT and <\/span><a href=\"https:\/\/arxiv.org\/abs\/2307.09288\"><span style=\"font-weight: 400;\">Llama-2-chat<\/span><\/a><span style=\"font-weight: 400;\"> by having the model generate outputs for a range of example inputs paired with desired outputs. It allows for more controlled and desired behavior of the model in specific applications or tasks. Also, it can enhance the reliability, specificity, and safety of AI systems in real-world use cases.<\/span><\/p>\n<div id=\"attachment_5648769\" style=\"width: 631px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648769\" class=\"wp-image-5648769\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/image1.png\" alt=\"\" width=\"621\" height=\"317\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/image1-300x154.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/image1-1024x525.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/image1-300x154@2x.png 600w\" sizes=\"(max-width: 621px) 100vw, 621px\" \/><p id=\"caption-attachment-5648769\" class=\"wp-caption-text\">Annotated figure from <a href=\"https:\/\/arxiv.org\/abs\/2203.02155\">InstructGPT paper<\/a><\/p><\/div>\n<p><span style=\"font-weight: 400;\">Instruction finetuning uses a dataset consisting of instruction-response pairs to improve an LLM&#8217;s instruction-following capabilities. Such a dataset for instruction finetuning typically consists of three components:<\/span><\/p>\n<ol>\n<li><span style=\"font-weight: 400;\"> Instruction text<\/span><\/li>\n<li><span style=\"font-weight: 400;\"> Input text (optional)<\/span><\/li>\n<li><span style=\"font-weight: 400;\"> Output text<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">The example below lists two training examples, one without and one with an optional input text:<\/span><\/p>\n<div id=\"attachment_5648782\" style=\"width: 476px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648782\" class=\"wp-image-5648782\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image3.png\" alt=\"\" width=\"466\" height=\"351\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image3.png 890w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image3-300x226.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image3-300x226@2x.png 600w\" sizes=\"(max-width: 466px) 100vw, 466px\" \/><p id=\"caption-attachment-5648782\" class=\"wp-caption-text\">Instruction finetuning format<\/p><\/div>\n<p><span style=\"font-weight: 400;\">LLMs are then finetuned on these instruction datasets via next-token prediction (similar to pretraining). The difference from pretraining is that the model sees the whole instruction and input text as a context before it&#8217;s tasked to carry out the next-token prediction to generate the output text in an autoregressive fashion, illustrated below.<\/span><\/p>\n<div id=\"attachment_5648781\" style=\"width: 833px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648781\" class=\"wp-image-5648781\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image2.png\" alt=\"\" width=\"823\" height=\"523\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image2.png 1798w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image2-300x191.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image2-1024x652.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image2-1536x977.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image2-300x191@2x.png 600w\" sizes=\"(max-width: 823px) 100vw, 823px\" \/><p id=\"caption-attachment-5648781\" class=\"wp-caption-text\">Finetuning LLMs on instruction datasets<\/p><\/div>\n<p><span style=\"font-weight: 400;\">This above-mentioned process for finetuning an LLM to generate the desired output in an iterative, token-wise fashion is also referred to as supervised finetuning.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In practice, there is an additional optional finetuning stage following supervised finetuning, which uses additional preference data and ranking labels from human annotators who compare responses generated by LLMs. This process is also known as reinforcement learning with human feedback (RLHF), but it is out-of-scope for this article, which focuses on the instruction datasets themselves. (However, I have an optional article on RLHF <\/span><a href=\"https:\/\/magazine.sebastianraschka.com\/p\/llm-training-rlhf-and-its-alternatives\"><span style=\"font-weight: 400;\">here<\/span><\/a><span style=\"font-weight: 400;\"> if you want to learn more.)<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc2\"><span style=\"font-weight: 400;\">The Finetuning Pipeline and Dataset Origins<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">When finetuning LLMs, datasets for instruction finetuning can be sourced in multiple ways:<\/span><\/p>\n<ol>\n<li><b> Human-created:<\/b><span style=\"font-weight: 400;\"> Expert annotators can provide explicit instructions and feedback, creating datasets for instruction finetuning. This is particularly useful for domain-specific tasks or for reducing particular biases or unwanted behaviors.<\/span><\/li>\n<li><b> LLM-generated: <\/b><span style=\"font-weight: 400;\">We can generate a vast amount of potential input-output pairs using an existing LLM (if the terms of service permit). These can then be refined or rated by humans for quality and then used to finetune a new LLM. This method is usually more efficient than the abovementioned human-created approach because an available LLM, such as GPT-4 (via the API interface), can generate a large number of potential examples in a short time.\u00a0<\/span><\/li>\n<\/ol>\n<p><span style=\"font-weight: 400;\">The LLM finetuning pipeline using human-created or LLM-generated data is summarized in the recent and excellent <\/span><a href=\"https:\/\/arxiv.org\/abs\/2308.10792\"><i><span style=\"font-weight: 400;\">Instruction Tuning for Large Language Models<\/span><\/i><\/a><span style=\"font-weight: 400;\"> survey:<\/span><\/p>\n<div id=\"attachment_5648784\" style=\"width: 740px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648784\" class=\"wp-image-5648784\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image5.png\" alt=\"\" width=\"730\" height=\"303\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image5.png 1999w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image5-300x125.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image5-1024x426.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image5-1536x639.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image5-300x125@2x.png 600w\" sizes=\"(max-width: 730px) 100vw, 730px\" \/><p id=\"caption-attachment-5648784\" class=\"wp-caption-text\">Figure from <a href=\"https:\/\/arxiv.org\/abs\/2308.10792\">Instruction Tuning for Large Language Models paper<\/a><\/p><\/div>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">Additionally, we can also potentially combine both human-created and LLM-generated instruction data to get the best of both worlds.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The upcoming sections will discuss LLM-generated and human-created datasets for instruction finetuning in more detail, including the recent research highlights.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc3\"><span style=\"font-weight: 400;\">LLM-generated datasets<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Dataset labeling has been a bottleneck in machine learning ever since. As a human annotator, simple labeling tasks like categorizing an image as &#8220;cat&#8221; or &#8220;dog&#8221; are already considered laborious when it has to be done at scale.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Tasks requiring long-form text annotations can be even more time-consuming and challenging. So, a lot of effort has been devoted towards generating datasets for instruction finetuning automatically using existing LLMs.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>Self-Instruct<\/b><\/p>\n<p><span style=\"font-weight: 400;\">One of the most prominent and widely used methods for LLM-generated datasets is <\/span><a href=\"https:\/\/arxiv.org\/abs\/2212.10560\"><span style=\"font-weight: 400;\">Self-Instruct<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">So, how does it work? Briefly, it involves four stages:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Seed task pool with a set of human-written instructions (175 in this case) and sample instructions;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Use a pretrained LLM (like GPT-3) to determine the task category;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Given the new instruction, let a pretrained LLM generate the response;<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Collect, prune, and filter the responses before adding them to the task pool.<\/span><\/li>\n<\/ol>\n<div id=\"attachment_5648783\" style=\"width: 908px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648783\" class=\"wp-image-5648783 \" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image4.jpg\" alt=\"\" width=\"898\" height=\"406\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image4.jpg 1456w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image4-300x136.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image4-1024x463.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image4-300x136@2x.jpg 600w\" sizes=\"(max-width: 898px) 100vw, 898px\" \/><p id=\"caption-attachment-5648783\" class=\"wp-caption-text\">Annotated figure from <a href=\"https:\/\/arxiv.org\/abs\/2212.10560\">Self-Instruct paper<\/a><\/p><\/div>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">An early popular application of Self-Instruct was the <\/span><a href=\"https:\/\/github.com\/gururise\/AlpacaDataCleaned\"><span style=\"font-weight: 400;\">Alpaca dataset<\/span><\/a><span style=\"font-weight: 400;\">, which consists of 52k LLM-generated instruction-response pairs. Alpaca was used to create the first finetuning <\/span><a href=\"https:\/\/arxiv.org\/abs\/2302.13971\"><span style=\"font-weight: 400;\">Llama v1<\/span><\/a><span style=\"font-weight: 400;\"> model earlier this year.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>Backtranslation<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Another interesting type of approach involves working backward from the responses and generating the corresponding instructions via LLMs.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In other words, rather than gathering datasets for instruction finetuning from human writers, it&#8217;s possible to employ an LLM to produce instruction-response pairs (also known as distillation).\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In a paper titled <\/span><a href=\"https:\/\/arxiv.org\/abs\/2308.06259\"><i><span style=\"font-weight: 400;\">Self-Alignment with Instruction Backtranslation<\/span><\/i><\/a><span style=\"font-weight: 400;\">, researchers refined LLMs via &#8220;instruction backtranslation&#8221; and found that this method surpasses those trained on distillation datasets like Alpaca.<\/span><\/p>\n<div id=\"attachment_5648786\" style=\"width: 695px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648786\" class=\"wp-image-5648786\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image7.png\" alt=\"\" width=\"685\" height=\"658\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image7.png 1624w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image7-300x288.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image7-1024x984.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image7-1536x1475.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image7-300x288@2x.png 600w\" sizes=\"(max-width: 685px) 100vw, 685px\" \/><p id=\"caption-attachment-5648786\" class=\"wp-caption-text\">Annotated figures from <a href=\"https:\/\/arxiv.org\/abs\/2308.06259\">Self-Alignment with Instruction Backtranslation paper<\/a><\/p><\/div>\n<p>&nbsp;<\/p>\n<p><b>NeurIPS Efficiency Challenge Rules<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Note that the <\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\"><span style=\"font-weight: 400;\">NeurIPS LLM Efficiency Challenge<\/span><\/a><span style=\"font-weight: 400;\">, which is centered around training 1 LLM for 1 Day on 1 GPU, does not permit LLM-generated datasets.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">So, in the next section, <\/span><i><span style=\"font-weight: 400;\">High-Quality Datasets<\/span><\/i><span style=\"font-weight: 400;\">, we will focus on human-generated instruction datasets that we can use as an alternative.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If you are interested in participating in the <\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\"><span style=\"font-weight: 400;\">NeurIPS LLM Efficiency Challenge<\/span><\/a><span style=\"font-weight: 400;\">, I&#8217;ve written a <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/neurips_challenge_quickstart.md\"><span style=\"font-weight: 400;\">quick starter tutorial here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>A Note About LLM-generated Datasets and Imitation Models<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Before we jump into the discussion of human-generated datasets for instruction finetuning, I wanted to share a brief word of caution regarding LLM-generated datasets. Yes, generating datasets via LLMs may sound too good to be true, so it is important to evaluate LLMs finetuned on LLM-generated datasets extra carefully.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For instance, in a recent <\/span><a href=\"https:\/\/arxiv.org\/abs\/2305.15717\"><i><span style=\"font-weight: 400;\">The False Promise of Imitating Proprietary LLMs<\/span><\/i><\/a><span style=\"font-weight: 400;\"> paper, researchers observed that crowd workers gave high ratings to LLMs trained on LLM-generated data. However, these so-called &#8220;imitation models&#8221; primarily replicated the style of the upstream LLMs they were trained on rather than their factual accuracy.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc4\"><span style=\"font-weight: 400;\">High-quality Datasets: Less May Be More<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In the previous section, we discussed datasets generated by LLMs. Now, let&#8217;s switch gears and examine a high-quality, human-generated dataset, which is also allowed in the <\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\"><span style=\"font-weight: 400;\">NeurIPS LLM Efficiency Challenge<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>LIMA<\/b><\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2305.11206\"><i><span style=\"font-weight: 400;\">The LIMA: Less Is More for Alignment<\/span><\/i><\/a><span style=\"font-weight: 400;\"> paper shows that quality trumps quantity when instruction finetuning datasets.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In this study, researchers carefully selected 1,000 instruction pairs to finetune the 65-billion-parameter Llama-v1 model, known as LIMA, using supervised finetuning.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Notably, other finetuned Llama models, such as Alpaca, were trained on a considerably larger dataset of 52,000 LLM-generated instruction pairs. In selected benchmarks, LIMA outperformed models that employed Reinforcement Learning with Human Feedback (RLHF) methods, including ChatGPT and GPT-3.5.<\/span><\/p>\n<div id=\"attachment_5648785\" style=\"width: 692px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648785\" class=\"wp-image-5648785\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image6.jpg\" alt=\"\" width=\"682\" height=\"272\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image6.jpg 1456w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image6-300x119.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image6-1024x407.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image6-300x119@2x.jpg 600w\" sizes=\"(max-width: 682px) 100vw, 682px\" \/><p id=\"caption-attachment-5648785\" class=\"wp-caption-text\">Annotated figure from the <a href=\"https:\/\/arxiv.org\/abs\/2305.11206\">LIMA paper\u00a0<\/a><\/p><\/div>\n<p><span style=\"font-weight: 400;\">The next section will show you how to get started with open-source LLMs and finetune these models on LIMA.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc5\"><span style=\"font-weight: 400;\">Finetuning LLMs on LIMA<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">This section explains how to finetune open-source LLMs on instruction datasets like LIMA using the <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\"><span style=\"font-weight: 400;\">Lit-GPT repository<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">(Note that the <\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\"><span style=\"font-weight: 400;\">NeurIPS LLM Efficiency Challenge<\/span><\/a><span style=\"font-weight: 400;\"> organizers cleared LIMA for the competition. The NeurIPS LLM Efficiency Challenge organizers also selected Lit-GPT as the starter kit since the code is relatively easy to use and customize, which is an essential prerequisite for exploring new research directions.)<\/span><\/p>\n<p><span style=\"font-weight: 400;\">As of this writing, the currently supported models in Lit-GPT are the following:<\/span><\/p>\n<table>\n<tbody>\n<tr>\n<td><b>Model and usage<\/b><\/td>\n<td><b>Reference<\/b><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Meta AI <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/download_llama_2.md\"><span style=\"font-weight: 400;\">Llama 2<\/span><\/a><\/td>\n<td><a href=\"https:\/\/arxiv.org\/abs\/2307.09288\"><span style=\"font-weight: 400;\">Touvron et al. 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Stability AI <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/download_freewilly_2.md\"><span style=\"font-weight: 400;\">FreeWilly2<\/span><\/a><\/td>\n<td><a href=\"https:\/\/stability.ai\/blog\/stable-beluga-large-instruction-fine-tuned-models\"><span style=\"font-weight: 400;\">Stability AI 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Stability AI StableCode<\/span><\/td>\n<td><a href=\"https:\/\/stability.ai\/blog\/stablecode-llm-generative-ai-coding\"><span style=\"font-weight: 400;\">Stability AI 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">TII UAE <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/download_falcon.md\"><span style=\"font-weight: 400;\">Falcon<\/span><\/a><\/td>\n<td><a href=\"https:\/\/falconllm.tii.ae\/\"><span style=\"font-weight: 400;\">TII 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">OpenLM Research <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/download_openllama.md\"><span style=\"font-weight: 400;\">OpenLLaMA<\/span><\/a><\/td>\n<td><a href=\"https:\/\/github.com\/openlm-research\/open_llama\"><span style=\"font-weight: 400;\">Geng &amp; Liu 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">LMSYS <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/download_vicuna.md\"><span style=\"font-weight: 400;\">Vicuna<\/span><\/a><\/td>\n<td><a href=\"https:\/\/lmsys.org\/blog\/2023-03-30-vicuna\/\"><span style=\"font-weight: 400;\">Li et al. 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">LMSYS <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/download_longchat.md\"><span style=\"font-weight: 400;\">LongChat<\/span><\/a><\/td>\n<td><a href=\"https:\/\/lmsys.org\/blog\/2023-06-29-longchat\/\"><span style=\"font-weight: 400;\">LongChat Team 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Together <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/download_redpajama_incite.md\"><span style=\"font-weight: 400;\">RedPajama-INCITE<\/span><\/a><\/td>\n<td><a href=\"https:\/\/together.ai\/blog\/redpajama-models-v1\"><span style=\"font-weight: 400;\">Together 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">EleutherAI <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/download_pythia.md\"><span style=\"font-weight: 400;\">Pythia<\/span><\/a><\/td>\n<td><a href=\"https:\/\/arxiv.org\/abs\/2304.01373\"><span style=\"font-weight: 400;\">Biderman et al. 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">StabilityAI <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/download_stablelm.md\"><span style=\"font-weight: 400;\">StableLM<\/span><\/a><\/td>\n<td><a href=\"https:\/\/github.com\/Stability-AI\/StableLM\"><span style=\"font-weight: 400;\">Stability AI 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Platypus<\/span><\/td>\n<td><a href=\"https:\/\/arxiv.org\/abs\/2308.07317\"><span style=\"font-weight: 400;\">Lee, Hunter, and Ruiz 2023<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">NousResearch Nous-Hermes<\/span><\/td>\n<td><a href=\"https:\/\/huggingface.co\/NousResearch\"><span style=\"font-weight: 400;\">Org page<\/span><\/a><\/td>\n<\/tr>\n<tr>\n<td><span style=\"font-weight: 400;\">Meta AI <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/download_code_llama.md\"><span style=\"font-weight: 400;\">Code Llama<\/span><\/a><\/td>\n<td><a href=\"https:\/\/arxiv.org\/abs\/2308.12950\"><span style=\"font-weight: 400;\">Rozi\u00e8re et al. 2023<\/span><\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">For this brief walkthrough, we will use the 7B parameter <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt#-lit-gpt-1\"><span style=\"font-weight: 400;\">Llama 2 base model<\/span><\/a><span style=\"font-weight: 400;\"> and finetune it on LIMA.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Assuming you have cloned the Lit-GPT repository, you can get started via the following three steps:<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><strong>1) Download and prepare the model:<\/strong><\/p>\n<pre class=\"hljs collapse-false language-python\">export HF_TOKEN=your_token\r\npython scripts\/download.py \\\r\n\u00a0\u00a0--repo_id meta-llama\/Llama-2-7b-hf<\/pre>\n<pre class=\"hljs collapse-false language-python\">python scripts\/convert_hf_checkpoint.py \\\r\n\u00a0\u00a0--checkpoint_dir meta-llama\/Llama-2-7b-hf<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>2) Prepare the dataset:<\/strong><\/p>\n<pre class=\"hljs collapse-false language-python\"><span style=\"font-weight: 400;\">python scripts\/prepare_lima.py \\\r\n\u00a0\u00a0--checkpoint_dir checkpoints\/meta-llama\/Llama-2-7b-hf<\/span><\/pre>\n<p>&nbsp;<\/p>\n<p><strong>3) Finetune the model using low-rank adaptation (LoRA):<\/strong><\/p>\n<pre class=\"hljs collapse-false language-python\"><span style=\"font-weight: 400;\">python finetune\/lora.py \\\r\n\u00a0\u00a0--checkpoint_dir checkpoints\/meta-llama\/Llama-2-7b-hf \\\r\n\u00a0\u00a0--data_dir data\/lima\r\n\r\n<\/span><\/pre>\n<p><span style=\"font-weight: 400;\">Note that the <\/span><span style=\"font-weight: 400;\">&#8211;checkpoint_dir<\/span><span style=\"font-weight: 400;\"> argument is required for preparing the dataset in step 2 because the dataset preparation is model-dependent. Different LLMs may use different tokenizers and special tokens, so it&#8217;s important to prepare the dataset accordingly.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">I am skipping a detailed explanations of the LoRA finetuning procedure to keep this article focused on the dataset perspective. However, if you are interested in learning more, you can see my article <\/span><a href=\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/\"><span style=\"font-weight: 400;\">Finetuning Falcon LLMs More Efficiently With LoRA and Adapters<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In addition, you may also find my <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/neurips_challenge_quickstart.md\"><span style=\"font-weight: 400;\">NeurIPS 2023 LLM Efficiency Challenge Quickstart Guide<\/span><\/a><span style=\"font-weight: 400;\"> article helpful, where I walk through the setup, finetuning, and model evaluation step by step.<\/span><\/p>\n<p>&nbsp;<\/p>\n<hr \/>\n<p>&nbsp;<\/p>\n<p><b>Tip<\/b><\/p>\n<p><span style=\"font-weight: 400;\">According<\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\/question\"><span style=\"font-weight: 400;\"> to the official competition rules<\/span><\/a><span style=\"font-weight: 400;\">, the maximum context length used for the evaluation is 2,048 tokens. Hence, I recommend preparing the dataset with a maximum length of 2,048 tokens:<\/span><\/p>\n<pre class=\"hljs collapse-false language-python\"><span style=\"font-weight: 400;\">python scripts\/prepare_lima.py \\\r\n\u00a0\u00a0\u00a0\u00a0--checkpoint_dir checkpoints\/meta-llama\/Llama-2-7b-hf \\\r\n\u00a0\u00a0\u00a0\u00a0--max_seq_length 2048\r\n<\/span><\/pre>\n<p><span style=\"font-weight: 400;\">Alternatively you can edit the\u00a0 <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/finetune\/lora.py#L37\"><span style=\"font-weight: 400;\">finetune\/lora.py<\/span><span style=\"font-weight: 400;\"> file<\/span><\/a><span style=\"font-weight: 400;\"> and change <\/span><span style=\"font-weight: 400; color: #3366ff;\">override_max_seq_length = None<\/span><span style=\"font-weight: 400;\"> to <\/span><span style=\"font-weight: 400; color: #3366ff;\">override_max_seq_length = 2048<\/span><span style=\"font-weight: 400;\"> to reduce the GPU memory requirements.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In addition, I also suggest modifying the set <\/span><span style=\"font-weight: 400; color: #3366ff;\">max_iter<\/span><span style=\"font-weight: 400;\"> setting and change it to <\/span><span style=\"font-weight: 400; color: #3366ff;\">max_iter = 1000<\/span><span style=\"font-weight: 400;\"> to finetune for ~1 pass over the LIMA dataset, which consists of 1k training examples.<\/span><\/p>\n<div id=\"attachment_5648788\" style=\"width: 814px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648788\" class=\"wp-image-5648788\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image9.png\" alt=\"\" width=\"804\" height=\"393\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image9.png 1626w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image9-300x147.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image9-1024x501.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image9-1536x752.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image9-300x147@2x.png 600w\" sizes=\"(max-width: 804px) 100vw, 804px\" \/><p id=\"caption-attachment-5648788\" class=\"wp-caption-text\">Selecting the number of finetuning iterations<\/p><\/div>\n<p>&nbsp;<\/p>\n<hr \/>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">For reference, finetuning a 7B parameter model on 52k instruction pairs, such as in Alpaca, takes about 1 hour on an A100 GPU when using LoRA with default settings. Note that LIMA is 50x smaller than Alpaca, so finetuning will only take a few minutes.<\/span><\/p>\n<div id=\"attachment_5648787\" style=\"width: 617px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648787\" class=\"wp-image-5648787\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image8.png\" alt=\"\" width=\"607\" height=\"314\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image8.png 1264w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image8-300x155.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image8-1024x530.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image8-300x155@2x.png 600w\" sizes=\"(max-width: 607px) 100vw, 607px\" \/><p id=\"caption-attachment-5648787\" class=\"wp-caption-text\">Finetuning a 7B model on 52k data points via <a href=\"https:\/\/lightning.ai\/pages\/community\/finetuning-falcon-efficiently\/\">Finetuning Falcon LLMs More Efficiently With LoRA and Adapters<\/a><\/p><\/div>\n<p>&nbsp;<\/p>\n<h2 id=\"toc6\"><span style=\"font-weight: 400;\">Available Models and Datasets in Lit-GPT<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">As of this writing, there are currently multiple finetuning datasets supported in <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\"><span style=\"font-weight: 400;\">Lit-GPT<\/span><\/a><span style=\"font-weight: 400;\">:<\/span><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648791\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image12.png\" alt=\"\" width=\"736\" height=\"337\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image12.png 1984w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image12-300x138.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image12-1024x470.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image12-1536x705.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image12-300x138@2x.png 600w\" sizes=\"(max-width: 736px) 100vw, 736px\" \/><\/p>\n<p><span style=\"font-weight: 400;\">The <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/prepare_dataset.md#dolly\"><span style=\"font-weight: 400;\">Dolly<\/span><\/a><span style=\"font-weight: 400;\"> and <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/prepare_dataset.md#lima\"><span style=\"font-weight: 400;\">LIMA<\/span><\/a><span style=\"font-weight: 400;\"> datasets are human-generated and should thus be fine for use in the <\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\"><span style=\"font-weight: 400;\">NeurIPS LLM Efficiency Challenge<\/span><\/a><span style=\"font-weight: 400;\">.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Additionally, if you are interested in using different datasets to customize LLMs for your projects, the next section will briefly explain how this works.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc7\"><span style=\"font-weight: 400;\">Preparing New and Custom Datasets<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">In addition to the existing datasets mentioned above, you might be interested in adding new datasets or using your own datasets to finetune custom open-source LLMs.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">There are two main ways to prepare a dataset for the LLMs in Lit-GPT:<\/span><\/p>\n<ol>\n<li><span style=\"font-weight: 400;\"> Using the <code><span style=\"color: #3366ff;\">scripts\/prepare_csv.py<\/span><\/code> script to read an instruction dataset from a CSV file.<\/span><\/li>\n<li><span style=\"font-weight: 400;\"> Creating a custom <code><span style=\"color: #3366ff;\">scripts\/prepare_dataset.py<\/span><\/code> script similar to LIMA, which we used earlier.<\/span><\/li>\n<\/ol>\n<p>(Thanks to the <a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/pull\/462\">community contribution<\/a> via <a href=\"https:\/\/github.com\/Anindyadeep\">@Anindyadeep<\/a> that helped enable CSV file support!)<\/p>\n<p><span style=\"font-weight: 400;\">The easiest way to prepare a new dataset is to read it from a CSV file using the <code><span style=\"color: #3366ff;\">scripts\/prepare_csv.py<\/span><\/code> script in Lit-GPT. All you need is a CSV file that has the three column headers as shown below:<\/span><\/p>\n<div id=\"attachment_5648789\" style=\"width: 670px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648789\" class=\"wp-image-5648789\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image10.png\" alt=\"\" width=\"660\" height=\"310\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image10.png 1470w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image10-300x141.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image10-1024x482.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image10-300x141@2x.png 600w\" sizes=\"(max-width: 660px) 100vw, 660px\" \/><p id=\"caption-attachment-5648789\" class=\"wp-caption-text\">Requires column headers for the prepare_csv.py script<\/p><\/div>\n<p><span style=\"font-weight: 400;\">Assuming you exported this dataset as <code><span style=\"color: #3366ff;\">MyDataset.csv<\/span><\/code>, you can then prepare and finetune the model as follows:<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><strong>1) Prepare the dataset:<\/strong><\/p>\n<pre class=\"hljs collapse-false language-python\">python scripts\/prepare_csv.py \\\r\n\u00a0\u00a0\u00a0--csv_dir MyDataset.csv \\\r\n\u00a0\u00a0\u00a0--checkpoint_dir checkpoints\/meta-llama\/Llama-2-7b-hf<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>2) Finetune the model using low-rank adaptation (LoRA):<\/strong><\/p>\n<pre class=\"hljs collapse-false language-python\">python finetune\/lora.py \\\r\n\u00a0\u00a0\u00a0--data_dir \/data\/csv \\\r\n\u00a0\u00a0\u00a0--checkpoint_dir checkpoints\/meta-llama\/Llama-2-7b-hf<\/pre>\n<p><span style=\"font-weight: 400;\">There are additional options for determining the random seed or train\/split available that you can access via\u00a0<\/span><\/p>\n<pre class=\"hljs collapse-false language-python\"><span style=\"font-weight: 400;\">python scripts\/prepare_csv.py --help<\/span><\/pre>\n<p><span style=\"font-weight: 400;\">If you are interested in the second option, creating a <\/span><span style=\"font-weight: 400;\">prepare_dataset.py<\/span><span style=\"font-weight: 400;\"> script similar to LIMA, I added an explanation to the <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/prepare_dataset.md#preparing-custom-datasets-for-instruction-finetuning\"><span style=\"font-weight: 400;\">Lit-GPT documentation here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc8\"><span style=\"font-weight: 400;\">Additional Datasets to Consider<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">The previous section covered how to prepare custom datasets for open-source LLMs in Lit-GPT. If you don&#8217;t have your own dataset you want to experiment with but want to experiment with existing datasets (for example, the NeurIPS LLM Efficiency Challenge is restricted to publicly available datasets), here are a few pointers for datasets to explore.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">With the <\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\"><span style=\"font-weight: 400;\">NeurIPS LLM Efficiency Challenge<\/span><\/a><span style=\"font-weight: 400;\"> in mind, the list focuses on human-generated English datasets, not LLM-generated datasets.<\/span><\/p>\n<p><a href=\"https:\/\/huggingface.co\/datasets\/OpenAssistant\/oasst1\"><b>Open Assistant<\/b><\/a><span style=\"font-weight: 400;\"> (multi-lingual) is a collection of assistant-like conversations created and annotated by humans. It contains 161,443 messages in 35 languages, enriched with 461,292 quality evaluations, resulting in more than 10,000 comprehensively annotated conversation trees. This dataset results from a global crowdsourcing initiative that engaged over 13,500 volunteers.<\/span><\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2104.08773\"><b>Natural Instructions<\/b><\/a><span style=\"font-weight: 400;\"> is an English instruction dataset handcrafted with 193K entries, spanning 61 unique NLP tasks.<\/span><\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2110.08207\"><b>P3 (Public Pool of Prompts)<\/b><\/a><span style=\"font-weight: 400;\"> is an instruction finetuning dataset constructed using 170 English NLP datasets and 2,052 English prompts. Prompts, sometimes named task templates, map a data instance in a conventional NLP task (e.g., question answering, text classification) to a natural language input-output pair.<\/span><\/p>\n<p><a href=\"https:\/\/arxiv.org\/abs\/2301.13688\"><b>Flan 2021<\/b><\/a><span style=\"font-weight: 400;\"> is an English instruction dataset compilation created by converting 62 popular NLP benchmarks (including SNLI, AG News, and others) into pairs of language inputs and outputs.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc9\"><span style=\"font-weight: 400;\">Research Directions to Explore<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">Now that we have covered the why and how related to instruction-finetuning, what interesting research directions can we explore to boost the performance of open-source LLMS?<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>Merging Datasets<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Besides the P3 and Flan 2021 datasets mentioned above, I have not seen attempts to create larger datasets by combining datasets from multiple sources. For instance, it could make sense to experiment with combinations of LIMA and Dolly, and so forth.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>Dataset Ordering<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Following up on the dataset merging idea mentioned above, it could be interesting to explore the role of visiting different data points in different orders (for example, sorted or shuffled by the type of instruction). Besides the pretraining experiments done in the <\/span><a href=\"https:\/\/arxiv.org\/abs\/2304.01373\"><span style=\"font-weight: 400;\">Pythia paper<\/span><\/a><span style=\"font-weight: 400;\">, I have not seen any studies on dataset ordering in the context of instruction finetuning.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>Multiple-Epoch Training<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Due to the large dataset size requirements, LLMs are usually pretrained for less than one epoch, which means that they don&#8217;t revisit data points multiple times. While computational costs are one reason for this, another is that LLMs can be prone to overfitting. Nonetheless, with many overfitting-reduction techniques at our disposal, studying multi-epoch training in the context of LLMs would be interesting.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">For instance, it&#8217;s possible to train LLMs on a small dataset like LIMA in a few minutes. Would it make sense to iterate over the dataset multiple times?<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><b>Automatic Quality-filtering<\/b><\/p>\n<p><span style=\"font-weight: 400;\">Does it make sense to adopt dataset filtering as a default?<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Related to the LIMA study discussed earlier, the <\/span><a href=\"https:\/\/arxiv.org\/abs\/2307.08701\"><i><span style=\"font-weight: 400;\">AlpaGasus: Training A Better Alpaca with Fewer Data<\/span><\/i><\/a><span style=\"font-weight: 400;\"> paper also emphasizes that a larger dataset isn&#8217;t necessarily advantageous for finetuning LLMs. In the <\/span><i><span style=\"font-weight: 400;\">AlpaGasus <\/span><\/i><span style=\"font-weight: 400;\">study, the researchers employed ChatGPT to pinpoint low-quality instruction-response pairs in the original 52,000-instance Alpaca dataset. They discovered that reducing this to just 9,000 high-quality pairs actually enhanced performance when training Llama-v1 LLMs with 7 billion and 13 billion parameters.<\/span><\/p>\n<div id=\"attachment_5648790\" style=\"width: 613px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" aria-describedby=\"caption-attachment-5648790\" class=\"wp-image-5648790\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image11.jpg\" alt=\"\" width=\"603\" height=\"552\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image11.jpg 1456w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image11-300x274.jpg 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image11-1024x935.jpg 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/LLM-dataset-research-image11-300x274@2x.jpg 600w\" sizes=\"(max-width: 603px) 100vw, 603px\" \/><p id=\"caption-attachment-5648790\" class=\"wp-caption-text\">Annotated figure from the <a href=\"https:\/\/arxiv.org\/abs\/2307.08701\">AlpaGasus paper<\/a><\/p><\/div>\n<p>&nbsp;<\/p>\n<p><span style=\"font-weight: 400;\">However, as mentioned earlier, the <\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\"><span style=\"font-weight: 400;\">NeurIPS LLM Efficiency Challenge<\/span><\/a><span style=\"font-weight: 400;\"> does not permit LLM-generated datasets. So, this Alpaca-based Alpagasus dataset would not be useful for this competition.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A viable alternative to AlpaGasus might be to use an LLM to filter human-generated (instead of LLM-generated) datasets. However, I&#8217;m uncertain if using LLM-based dataset filtering is allowed, so it would be important to confirm with the organizers on their Discord channel before using such datasets in the competition.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The upcoming sections will explain how to use datasets such as LIMA for training the latest open-source LLMs. Additionally, I will also highlight interesting research directions to try in the <\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\"><span style=\"font-weight: 400;\">NeurIPS LLM Efficiency Challenge<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/p>\n<p>&nbsp;<\/p>\n<h2 id=\"toc10\"><span style=\"font-weight: 400;\">Conclusion<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">This article covered instruction finetuning and explained the advantages of LLM-generated and human-generated datasets. We also went over a quick tutorial explaining how to finetune open-source LLMs with different datasets and how to use our own datasets to create custom LLMs. Compared to proprietary APIs and services, such custom LLMs can help leverage specific datasets at your company, improve LLMs on certain use cases, and give you full privacy control.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">If you have any questions, please don&#8217;t hesitate to reach out:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">If you have any suggestions, feedback, or problems with Lit-GPT, please consider filing an <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/issues\"><span style=\"font-weight: 400;\">Issue on GitHub<\/span><\/a><span style=\"font-weight: 400;\"> if you think it is a bug.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">Furthermore, <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/pulls\"><span style=\"font-weight: 400;\">Lit-GPT pull requests<\/span><\/a><span style=\"font-weight: 400;\"> with improvements and implementations of new techniques would be very welcome!<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">If you are participating in the <\/span><a href=\"https:\/\/llm-efficiency-challenge.github.io\"><span style=\"font-weight: 400;\">NeurIPS LLM Efficiency Challenge<\/span><\/a><span style=\"font-weight: 400;\">, I hope you find this competition as useful and exciting as I do.\u00a0<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">I suggest starting with the <\/span><a href=\"https:\/\/github.com\/Lightning-AI\/lit-gpt\/blob\/main\/tutorials\/neurips_challenge_quickstart.md\"><span style=\"font-weight: 400;\">Quick Starter Guide I compiled here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">For questions about whether a particular dataset is allowed in the competition, I recommend double-checking with the organizers via their <\/span><a href=\"https:\/\/discord.gg\/XJwQ5ddMK7\"><span style=\"font-weight: 400;\">Discord channel<\/span><\/a><span style=\"font-weight: 400;\">.\u00a0<\/span><\/li>\n<li style=\"font-weight: 400;\" aria-level=\"1\"><span style=\"font-weight: 400;\">For Lit-GPT-related questions about the challenge, my colleagues at Lightning AI also maintain a <\/span><a href=\"https:\/\/discord.gg\/MWAEvnC5fU\"><span style=\"font-weight: 400;\">Discord channel here<\/span><\/a><span style=\"font-weight: 400;\">.<\/span><\/li>\n<\/ul>\n<p><span style=\"font-weight: 400;\">Happy learning, coding, and experimenting!<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article focuses on improving the modeling performance of LLMs by finetuning them using carefully curated datasets. Specifically, this article highlights strategies that involve modifying, utilizing, or manipulating the datasets for instruction-based finetuning rather than altering the model architecture or training algorithms (the latter will be topics of a future article). This article will also<a class=\"excerpt-read-more\" href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/\" title=\"ReadOptimizing LLMs from a Dataset Perspective\">&#8230; Read more &raquo;<\/a><\/p>\n","protected":false},"author":16,"featured_media":5648795,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27,106,41],"tags":[96,31,186,189,193,188,109,111],"glossary":[],"acf":{"additional_authors":false,"hide_from_archive":false,"content_type":"Blog Post","sticky":false,"code_embed":false,"custom_scripts":"","custom_styles":"main h2 {scroll-margin-top: 100px;scroll-padding-top: 100px;}.toc ul li:last-of-type{border-bottom:0px !important;box-shadow:none;}","tabs":false,"mathjax":false,"default_editor":true,"show_table_of_contents":true,"table_of_contents":"<h4>Table of Contents<\/h4>\n<ul>\n<li><a style=\"font-weight: 400;\" href=\"#toc1\">Supervised Instruction Finetuning<\/a><\/li>\n<li><a style=\"font-weight: 400;\" href=\"#toc2\">The Finetuning Pipeline and Dataset Origins<\/a><\/li>\n<li><a style=\"font-weight: 400;\" href=\"#toc3\">LLM-generated datasets<\/a><\/li>\n<li><a style=\"font-weight: 400;\" href=\"#toc4\">High-quality Datasets: Less May Be More<\/a><\/li>\n<li><a style=\"font-weight: 400;\" href=\"#toc5\">Finetuning LLMs on LIMA<\/a><\/li>\n<li><a style=\"font-weight: 400;\" href=\"#toc6\">Available Models and Datasets in Lit-GPT<\/a><\/li>\n<li><a style=\"font-weight: 400;\" href=\"#toc7\">Preparing New and Custom Datasets<\/a><\/li>\n<li><a style=\"font-weight: 400;\" href=\"#toc8\">Additional Datasets to Consider<\/a><\/li>\n<li><a style=\"font-weight: 400;\" href=\"#toc9\">Research Directions to Explore<\/a><\/li>\n<li><a style=\"font-weight: 400;\" href=\"#toc10\">Conclusion<\/a><\/li>\n<\/ul>\n<style>main h2 {scroll-margin-top: 100px;scroll-padding-top: 100px;}.toc ul li:last-of-type{border-bottom:0px !important;box-shadow:none;}<\/style>\n"},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Optimizing LLMs from a Dataset Perspective - Lightning AI<\/title>\n<meta name=\"description\" content=\"Discover new research directions to improve Large Language Models (LLMs) and learn how to enhance the performance of instruction-finetuned LLMs by concentrating on higher-quality data and exploring diverse dataset sources.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Optimizing LLMs from a Dataset Perspective - Lightning AI\" \/>\n<meta property=\"og:description\" content=\"Discover new research directions to improve Large Language Models (LLMs) and learn how to enhance the performance of instruction-finetuned LLMs by concentrating on higher-quality data and exploring diverse dataset sources.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/\" \/>\n<meta property=\"og:site_name\" content=\"Lightning AI\" \/>\n<meta property=\"article:published_time\" content=\"2023-09-14T18:03:20+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-09-15T15:53:12+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-14-at-11.13.07-AM.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1508\" \/>\n\t<meta property=\"og:image:height\" content=\"1334\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"JP Hennessy\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:site\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"JP Hennessy\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"15 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/\"},\"author\":{\"name\":\"JP Hennessy\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\"},\"headline\":\"Optimizing LLMs from a Dataset Perspective\",\"datePublished\":\"2023-09-14T18:03:20+00:00\",\"dateModified\":\"2023-09-15T15:53:12+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/\"},\"wordCount\":2899,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-14-at-11.13.07-AM.png\",\"keywords\":[\"ai\",\"deep learning\",\"finetuning\",\"GPT\",\"LLaMA\",\"LLMs\",\"NLP\",\"Open Source\"],\"articleSection\":[\"Articles\",\"Community\",\"Tutorials\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/\",\"url\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/\",\"name\":\"Optimizing LLMs from a Dataset Perspective - Lightning AI\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-14-at-11.13.07-AM.png\",\"datePublished\":\"2023-09-14T18:03:20+00:00\",\"dateModified\":\"2023-09-15T15:53:12+00:00\",\"description\":\"Discover new research directions to improve Large Language Models (LLMs) and learn how to enhance the performance of instruction-finetuned LLMs by concentrating on higher-quality data and exploring diverse dataset sources.\",\"breadcrumb\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#primaryimage\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-14-at-11.13.07-AM.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-14-at-11.13.07-AM.png\",\"width\":1508,\"height\":1334},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/lightning.ai\/pages\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Optimizing LLMs from a Dataset Perspective\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/lightning.ai\/pages\/#website\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"name\":\"Lightning AI\",\"description\":\"The platform for teams to build AI.\",\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/lightning.ai\/pages\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\",\"name\":\"Lightning AI\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"width\":1744,\"height\":856,\"caption\":\"Lightning AI\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/LightningAI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6\",\"name\":\"JP Hennessy\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g\",\"caption\":\"JP Hennessy\"},\"url\":\"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Optimizing LLMs from a Dataset Perspective - Lightning AI","description":"Discover new research directions to improve Large Language Models (LLMs) and learn how to enhance the performance of instruction-finetuned LLMs by concentrating on higher-quality data and exploring diverse dataset sources.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/","og_locale":"en_US","og_type":"article","og_title":"Optimizing LLMs from a Dataset Perspective - Lightning AI","og_description":"Discover new research directions to improve Large Language Models (LLMs) and learn how to enhance the performance of instruction-finetuned LLMs by concentrating on higher-quality data and exploring diverse dataset sources.","og_url":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/","og_site_name":"Lightning AI","article_published_time":"2023-09-14T18:03:20+00:00","article_modified_time":"2023-09-15T15:53:12+00:00","og_image":[{"width":1508,"height":1334,"url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-14-at-11.13.07-AM.png","type":"image\/png"}],"author":"JP Hennessy","twitter_card":"summary_large_image","twitter_creator":"@LightningAI","twitter_site":"@LightningAI","twitter_misc":{"Written by":"JP Hennessy","Est. reading time":"15 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#article","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/"},"author":{"name":"JP Hennessy","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6"},"headline":"Optimizing LLMs from a Dataset Perspective","datePublished":"2023-09-14T18:03:20+00:00","dateModified":"2023-09-15T15:53:12+00:00","mainEntityOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/"},"wordCount":2899,"commentCount":0,"publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-14-at-11.13.07-AM.png","keywords":["ai","deep learning","finetuning","GPT","LLaMA","LLMs","NLP","Open Source"],"articleSection":["Articles","Community","Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/","url":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/","name":"Optimizing LLMs from a Dataset Perspective - Lightning AI","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/#website"},"primaryImageOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#primaryimage"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-14-at-11.13.07-AM.png","datePublished":"2023-09-14T18:03:20+00:00","dateModified":"2023-09-15T15:53:12+00:00","description":"Discover new research directions to improve Large Language Models (LLMs) and learn how to enhance the performance of instruction-finetuned LLMs by concentrating on higher-quality data and exploring diverse dataset sources.","breadcrumb":{"@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#primaryimage","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-14-at-11.13.07-AM.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/09\/Screenshot-2023-09-14-at-11.13.07-AM.png","width":1508,"height":1334},{"@type":"BreadcrumbList","@id":"https:\/\/lightning.ai\/pages\/community\/tutorial\/optimizing-llms-from-a-dataset-perspective\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/lightning.ai\/pages\/"},{"@type":"ListItem","position":2,"name":"Optimizing LLMs from a Dataset Perspective"}]},{"@type":"WebSite","@id":"https:\/\/lightning.ai\/pages\/#website","url":"https:\/\/lightning.ai\/pages\/","name":"Lightning AI","description":"The platform for teams to build AI.","publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/lightning.ai\/pages\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/lightning.ai\/pages\/#organization","name":"Lightning AI","url":"https:\/\/lightning.ai\/pages\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","width":1744,"height":856,"caption":"Lightning AI"},"image":{"@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/LightningAI"]},{"@type":"Person","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/2518f4d5541f8e98016f6289169141a6","name":"JP Hennessy","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/28ade268218ae45f723b0b62499f527a?s=96&d=mm&r=g","caption":"JP Hennessy"},"url":"https:\/\/lightning.ai\/pages\/author\/jplightning-ai\/"}]}},"_links":{"self":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5648765"}],"collection":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/comments?post=5648765"}],"version-history":[{"count":0,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5648765\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media\/5648795"}],"wp:attachment":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media?parent=5648765"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/categories?post=5648765"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/tags?post=5648765"},{"taxonomy":"glossary","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/glossary?post=5648765"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}