{"id":5648236,"date":"2023-06-19T07:30:58","date_gmt":"2023-06-19T11:30:58","guid":{"rendered":"https:\/\/lightning.ai\/pages\/?p=5648236"},"modified":"2023-06-22T13:30:32","modified_gmt":"2023-06-22T17:30:32","slug":"efficient-initialization-of-large-models","status":"publish","type":"post","link":"https:\/\/lightning.ai\/pages\/community\/efficient-initialization-of-large-models\/","title":{"rendered":"Efficient Initialization of Large Models"},"content":{"rendered":"<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Takeaways<\/h3><br \/>\nOne of the challenges with LLMs is their cost and large memory footprint. In the upcoming Lightning 2.1 release, we introduce new features that optimize all three stages of LLM usage: pretraining, finetuning, and inference!<br \/>\n<\/div>\n<p>One of the biggest challenges with LLMs is dealing with their large GPU memory requirements. In our <a class=\"notion-link-token notion-focusable-token notion-enable-hover\" tabindex=\"0\" href=\"https:\/\/github.com\/Lightning-AI\/lit-llama\" rel=\"noopener noreferrer\" data-token-index=\"1\"><span class=\"link-annotation-unknown-block-id-1361474277\">Lit-LLaMA<\/span><\/a> and <a class=\"notion-link-token notion-focusable-token notion-enable-hover\" tabindex=\"0\" href=\"https:\/\/github.com\/Lightning-AI\/lit-parrot\" rel=\"noopener noreferrer\" data-token-index=\"3\"><span class=\"link-annotation-unknown-block-id--639100698\">Lit-Parrot<\/span><\/a> open-source LLM repositories, we\u2019ve implemented a few tricks that make it possible to run these models efficiently on consumer GPUs with limited memory. In the upcoming Lightning 2.1 release, we\u2019re making some of these improvements more widely available through <a class=\"notion-link-token notion-focusable-token notion-enable-hover\" tabindex=\"0\" href=\"https:\/\/lightning.ai\/fabric\" rel=\"noopener noreferrer\" data-token-index=\"5\"><span class=\"link-annotation-unknown-block-id-1305375513\">Lightning Fabric<\/span><\/a> so you can apply them to your own models by changing just one line of code!<\/p>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-5648285 size-large\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-15-at-01.03.28-1024x276.png\" alt=\"\" width=\"1024\" height=\"276\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-15-at-01.03.28-1024x276.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-15-at-01.03.28-300x81.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-15-at-01.03.28-1536x415.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-15-at-01.03.28-2048x553.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Screenshot-2023-06-15-at-01.03.28-300x81@2x.png 600w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/>Figure 1: We\u2019re introducing Fabric.init_module(), a trick to get your LLM onto the GPU faster while also saving on peak memory. And by enabling quantization and lazy loading, you can squeeze out even more memory savings. Lower numbers are better.<\/p>\n<h2>Efficient initialization with Fabric<\/h2>\n<p><a href=\"https:\/\/lightning.ai\/fabric\">Lightning Fabric<\/a> is what we use in our Lit-* repositories to minimize the boilerplate code needed to run models on different hardware without changing the code. Recently, we\u2019ve added a convenient context manager called <a href=\"https:\/\/lightning.ai\/docs\/fabric\/latest\/api\/fabric_methods.html#init-module\">Fabric.init_module()<\/a>\u00a0that handles a couple of things for you, which includes the following:<\/p>\n<ul>\n<li>Creating the model directly on the target device (e.g., GPU) without first allocating memory on the CPU<\/li>\n<li>Creating the weight tensors in the desired precision (e.g., float 16) without first allocating memory for full-precision<\/li>\n<li>Optionally delaying allocation of memory if the model is so large that it needs to be spread across multiple GPUs (FSDP, DeepSpeed). More on this in a future blog post!<\/li>\n<\/ul>\n<p>These three features combined reduce the peak memory usage during initialization and ultimately reduce the risk of you running out of memory.<\/p>\n<p>Here is the naive way of getting the model on the GPU for inference. We\u2019re initializing the weights of the Lit-LLaMA model, moving it to the GPU, and then converting it to a lower precision, which in total will require around 28 GB of memory if done this way:<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\nfrom lit_llama import LLaMA\n\nmodel = LLaMA.from_name(\"7B\") model.cuda().bfloat16()<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>It is pretty slow, and we would run out of memory if our GPU has less than 28 GB. Here is the efficient alternative with Fabric and <a href=\"https:\/\/lightning.ai\/docs\/fabric\/latest\/api\/fabric_methods.html#init-module\">Fabric.init_module()<\/a>:<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\nfrom lit_llama import LLaMA<br \/>\nimport lightning as L\n\nfabric = L.Fabric(accelerator=\"cuda\", precision=\"bf16-true\")\n\n<pre>with fabric.init_module():\r\n\u00a0 \u00a0 model = LLaMA.from_name(\"7B\")<\/pre>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>This is much faster and only takes half the memory.<\/p>\n<p>Let\u2019s take a look at some concrete numbers in an end-to-end example where we will compare the memory consumption and loading speed of LLaMA 7B on a consumer GPU.<\/p>\n<h2>Full example<\/h2>\n<p>Here we want to look at a realistic example of performing inference with a 7B LLaMA model. But before we can do that, we need to download and install a few things:<\/p>\n<ol>\n<li>Install <a href=\"https:\/\/github.com\/Lightning-AI\/lit-llama\">Lit-LLaMA<\/a> following the steps in the README:<br \/>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\ngit clone &lt;https:\/\/github.com\/Lightning-AI\/lit-llama&gt;<br \/>\ncd lit-llama<br \/>\npip install -r requirements.txt<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/li>\n<li>Download and convert the weights using <a href=\"https:\/\/github.com\/Lightning-AI\/lit-llama\/blob\/main\/howto\/download_weights.md\">the how-to guide<\/a>.<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\n<pre>python scripts\/download.py\r\n    --repo_id openlm-research\/open_llama_7b\r\n    --local_dir checkpoints\/open-llama\/7B<\/pre>\n<pre>python scripts\/convert_hf_checkpoint.py\r\n    --checkpoint_dir checkpoints\/open-llama\/7B\r\n    --model_size 7B ls checkpoints\/lit-llama<\/pre>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/li>\n<\/ol>\n<p>At this point, you should already be able to use the model for inference by running<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">python generate.py<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>but we will now write our own minimal inference code to measure a few things. Let\u2019s start with the baseline implementation, the standard way to load and run a model in PyTorch, <em>without any optimizations applied<\/em>. Hence, we simply create the model, load the checkpoint and measure how long that takes:<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\n# BASELINE - no optimizations<br \/>\nimport time<br \/>\nimport lightning as L<br \/>\nimport torch<br \/>\nfrom generate import generate<br \/>\nfrom lit_llama import LLaMA, Tokenizer\n\n# Init Fabric: Run on 1 GPU, with 16-bit precision<br \/>\nfabric = L.Fabric( accelerator=\"cuda\", devices=1, precision=\"bf16-true\", )<br \/>\n# Load pretrained weights file<br \/>\ncheckpoint = torch.load(\"checkpoints\/lit-llama\/7B\/lit-llama.pth\")<br \/>\n# Measure the time it takes to init the model and load weights<br \/>\nt0 = time.time()<br \/>\nmodel = LLaMA.from_name(\"7B\")<br \/>\nmodel.load_state_dict(checkpoint)<br \/>\nprint(f\"Time to load model: {time.time() - t0:.02f} seconds.\")<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>To get a realistic use case, we should also include an actual inference pass:<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\"><br \/>\nmodel.eval()<br \/>\nmodel = fabric.setup(model)\n\n# Let LLaMA complete the following sentence:<br \/>\nprompt = \"Hello, my name is\"<br \/>\ntokenizer = Tokenizer(\"checkpoints\/lit-llama\/tokenizer.model\")<br \/>\nencoded = tokenizer.encode(prompt, bos=True, eos=False, device=fabric.device)<br \/>\nprompt_length = encoded.size(0)<br \/>\ny = generate(model, encoded, max_new_tokens=50, temperature=0.8, top_k=200)\n\n# Print the response and the max. memory used by our GPU print(tokenizer.decode(y))<br \/>\nprint(f\"Memory used: {torch.cuda.max_memory_reserved() \/ 1e9:.02f} GB\")<br \/>\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>At the end, this script prints the time it took to load the model and the total amount of memory used on the GPU:<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">Time to load model: 38.99<br \/>\nseconds. Memory used: 13.54 GB<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<h2>Optimizing loading time and memory usage<\/h2>\n<p>The time to load the model is high because it first gets created on the CPU and then moved to the GPU later. The larger the model is, the higher this impact. To avoid the redundant creation on CPU, we could have PyTorch create the model directly on the GPU by making this modification:<\/p>\n<pre><pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\r\nwith fabric.device: # &lt;-- add this\r\n    model = LLaMA.from_name(\"7B\")<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/pre>\n<p>While this is faster now (only ~3 secs), the memory consumption got up to ~28 GB because the weights get allocated in full-precision (32-bit). However, we would like to <a href=\"https:\/\/lightning.ai\/pages\/community\/tutorial\/accelerating-large-language-models-with-mixed-precision-techniques\/\">run the model in 16-bit precision, or even 8-bit quantized<\/a> (more about that later). A memory peak like we see here is undesirable if we anyway convert the model to lower bit precision later on. Consumer GPUs would have run out of memory here (and you might have just done so if you\u2019re following this tutorial on a small GPU).<\/p>\n<p>Finally, let\u2019s try the new <code>init_module()<\/code> feature in Fabric by replacing the above code with this:<\/p>\n<pre><pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\r\nwith fabric.init_module(): # &lt;-- add this!\r\n    model = LLaMA.from_name(\"7B\")\r\n\r\nmodel.load_state_dict(checkpoint)<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre><\/pre>\n<p>We\u2019re getting a fast load time (~4 secs) and lower memory usage (~14 GB). We can summarize our findings in a table:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-5648241 size-full\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/first.png\" alt=\"\" width=\"2488\" height=\"672\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/first.png 2488w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/first-300x81.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/first-1024x277.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/first-1536x415.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/first-2048x553.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/first-300x81@2x.png 600w\" sizes=\"(max-width: 2488px) 100vw, 2488px\" \/><\/p>\n<p>In addition, we\u2019ve listed the CPU memory consumption. While it is great to see the 2x relative improvement with <code>init_module()<\/code> over the baseline, the absolute numbers here are still too high to make the 7B model run on a typical consumer machine with 12GB GPU memory and &lt;32 GB CPU memory. Luckily, we have two more tricks up our sleeves.<\/p>\n<h2>Lazy-loading and quantization<\/h2>\n<p>The high CPU memory usage is due to the fact that we\u2019re loading the checkpoint into CPU memory first before we copy the weights into the model on the GPU. We eliminate this redundancy in Lit-LLaMA with <a href=\"https:\/\/github.com\/Lightning-AI\/lit-llama\/blob\/8d865ec94c48e41f7e5c896abf4bae68d5d5cff5\/generate.py#L126\">lazy-loading the weight tensors in the checkpoint directly into the model on the GPU<\/a>. In layman terms, the trick here is to load each weight tensor individually one by one from the checkpoint. This means we only ever need to consume memory for a single weight tensor at a time, and never have to load the entire checkpoints (30GB+) at once as is normally done in PyTorch.<\/p>\n<p>Furthermore, we convert the weight matrices of the linear layers from 16-bit to 8-bit which results in a ~2x smaller memory footprint. To do this without a loss in predictive accuracy, we use a quantization method called <a href=\"https:\/\/arxiv.org\/abs\/2208.07339\">LLM.int8()<\/a> implemented in the <a href=\"https:\/\/github.com\/TimDettmers\/bitsandbytes\">bitsandbytes<\/a> library. This transformation is inexpensive and works by identifying outliers, i.e., numbers that would result in a large error when truncating them to 8 bits, and performing the matrix multiplications in 16-bit while the majority of operations (inliers) can be performed in 8-bit.<\/p>\n<p>The impact of lazy loading and quantization on top of <code>init_module()<\/code> is shown in the table below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone wp-image-5648242 size-full\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/second.png\" alt=\"\" width=\"2488\" height=\"672\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/second.png 2488w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/second-300x81.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/second-1024x277.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/second-1536x415.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/second-2048x553.png 2048w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/second-300x81@2x.png 600w\" sizes=\"(max-width: 2488px) 100vw, 2488px\" \/><\/p>\n<h2>Conclusion<\/h2>\n<p>In this tutorial, we\u2019ve learned about the <code>init_module()<\/code> feature in the upcoming Lightning Fabric release that helps us keep the peak GPU memory usage in control and enable fast loading times. It is especially helpful when we intend to run our model in lower bit precision since it avoids redundant memory allocation, both on GPU and on CPU. For highly-optimized inference and finetuning scripts, check out our lit-* repositories <a href=\"https:\/\/github.com\/Lightning-AI\/lit-llama\">Lit-LLaMA<\/a> and <a href=\"https:\/\/github.com\/Lightning-AI\/lit-parrot\">Lit-Parrot<\/a>. They contain state of the art LLMs, highly-optimized with the techniques discussed here, and are easy to consume even if you are new to working with LLMs thanks to boilerplate-free, minimalistic code.<\/p>\n<p>Join our <a href=\"https:\/\/discord.gg\/nnAuZvqTu3\">Discord community<\/a> to chat and ask your questions!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>One of the biggest challenges with LLMs is dealing with their large GPU memory requirements. In our Lit-LLaMA and Lit-Parrot open-source LLM repositories, we\u2019ve implemented a few tricks that make it possible to run these models efficiently on consumer GPUs with limited memory. In the upcoming Lightning 2.1 release, we\u2019re making some of these improvements<a class=\"excerpt-read-more\" href=\"https:\/\/lightning.ai\/pages\/community\/efficient-initialization-of-large-models\/\" title=\"ReadEfficient Initialization of Large Models\">&#8230; Read more &raquo;<\/a><\/p>\n","protected":false},"author":39,"featured_media":5648287,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[27,106,41],"tags":[],"glossary":[217],"acf":{"additional_authors":false,"mathjax":false,"default_editor":true,"show_table_of_contents":false,"hide_from_archive":false,"content_type":"Blog Post","sticky":false,"custom_styles":""},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Efficient Initialization of Large Models - Lightning AI<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Efficient Initialization of Large Models - Lightning AI\" \/>\n<meta property=\"og:description\" content=\"One of the biggest challenges with LLMs is dealing with their large GPU memory requirements. In our Lit-LLaMA and Lit-Parrot open-source LLM repositories, we\u2019ve implemented a few tricks that make it possible to run these models efficiently on consumer GPUs with limited memory. In the upcoming Lightning 2.1 release, we\u2019re making some of these improvements... Read more &raquo;\" \/>\n<meta property=\"og:url\" content=\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/\" \/>\n<meta property=\"og:site_name\" content=\"Lightning AI\" \/>\n<meta property=\"article:published_time\" content=\"2023-06-19T11:30:58+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-06-22T17:30:32+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Untitled-1.png\" \/>\n\t<meta property=\"og:image:width\" content=\"719\" \/>\n\t<meta property=\"og:image:height\" content=\"477\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Lightning.ai\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:site\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Lightning.ai\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"7 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/\"},\"author\":{\"name\":\"Lightning.ai\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/d53c9386be275d278c59022570c0d859\"},\"headline\":\"Efficient Initialization of Large Models\",\"datePublished\":\"2023-06-19T11:30:58+00:00\",\"dateModified\":\"2023-06-22T17:30:32+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/\"},\"wordCount\":1384,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Untitled-1.png\",\"articleSection\":[\"Articles\",\"Community\",\"Tutorials\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/\",\"url\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/\",\"name\":\"Efficient Initialization of Large Models - Lightning AI\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Untitled-1.png\",\"datePublished\":\"2023-06-19T11:30:58+00:00\",\"dateModified\":\"2023-06-22T17:30:32+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#primaryimage\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Untitled-1.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Untitled-1.png\",\"width\":719,\"height\":477},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/lightning.ai\/pages\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Efficient Initialization of Large Models\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/lightning.ai\/pages\/#website\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"name\":\"Lightning AI\",\"description\":\"The platform for teams to build AI.\",\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/lightning.ai\/pages\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\",\"name\":\"Lightning AI\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"width\":1744,\"height\":856,\"caption\":\"Lightning AI\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/LightningAI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/d53c9386be275d278c59022570c0d859\",\"name\":\"Lightning.ai\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/b75fef9be69cb600f385dfba5525cf77?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/b75fef9be69cb600f385dfba5525cf77?s=96&d=mm&r=g\",\"caption\":\"Lightning.ai\"},\"url\":\"https:\/\/lightning.ai\/pages\/author\/lightning-ai\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Efficient Initialization of Large Models - Lightning AI","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/","og_locale":"en_US","og_type":"article","og_title":"Efficient Initialization of Large Models - Lightning AI","og_description":"One of the biggest challenges with LLMs is dealing with their large GPU memory requirements. In our Lit-LLaMA and Lit-Parrot open-source LLM repositories, we\u2019ve implemented a few tricks that make it possible to run these models efficiently on consumer GPUs with limited memory. In the upcoming Lightning 2.1 release, we\u2019re making some of these improvements... Read more &raquo;","og_url":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/","og_site_name":"Lightning AI","article_published_time":"2023-06-19T11:30:58+00:00","article_modified_time":"2023-06-22T17:30:32+00:00","og_image":[{"width":719,"height":477,"url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Untitled-1.png","type":"image\/png"}],"author":"Lightning.ai","twitter_card":"summary_large_image","twitter_creator":"@LightningAI","twitter_site":"@LightningAI","twitter_misc":{"Written by":"Lightning.ai","Est. reading time":"7 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#article","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/"},"author":{"name":"Lightning.ai","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/d53c9386be275d278c59022570c0d859"},"headline":"Efficient Initialization of Large Models","datePublished":"2023-06-19T11:30:58+00:00","dateModified":"2023-06-22T17:30:32+00:00","mainEntityOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/"},"wordCount":1384,"commentCount":0,"publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Untitled-1.png","articleSection":["Articles","Community","Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/","url":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/","name":"Efficient Initialization of Large Models - Lightning AI","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/#website"},"primaryImageOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#primaryimage"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Untitled-1.png","datePublished":"2023-06-19T11:30:58+00:00","dateModified":"2023-06-22T17:30:32+00:00","breadcrumb":{"@id":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#primaryimage","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Untitled-1.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/06\/Untitled-1.png","width":719,"height":477},{"@type":"BreadcrumbList","@id":"https:\/\/lightning.ai\/pages\/community\/article\/efficient-initialization-of-large-models\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/lightning.ai\/pages\/"},{"@type":"ListItem","position":2,"name":"Efficient Initialization of Large Models"}]},{"@type":"WebSite","@id":"https:\/\/lightning.ai\/pages\/#website","url":"https:\/\/lightning.ai\/pages\/","name":"Lightning AI","description":"The platform for teams to build AI.","publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/lightning.ai\/pages\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/lightning.ai\/pages\/#organization","name":"Lightning AI","url":"https:\/\/lightning.ai\/pages\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","width":1744,"height":856,"caption":"Lightning AI"},"image":{"@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/LightningAI"]},{"@type":"Person","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/d53c9386be275d278c59022570c0d859","name":"Lightning.ai","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/b75fef9be69cb600f385dfba5525cf77?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/b75fef9be69cb600f385dfba5525cf77?s=96&d=mm&r=g","caption":"Lightning.ai"},"url":"https:\/\/lightning.ai\/pages\/author\/lightning-ai\/"}]}},"_links":{"self":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5648236"}],"collection":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/users\/39"}],"replies":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/comments?post=5648236"}],"version-history":[{"count":0,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5648236\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media\/5648287"}],"wp:attachment":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media?parent=5648236"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/categories?post=5648236"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/tags?post=5648236"},{"taxonomy":"glossary","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/glossary?post=5648236"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}