{"id":5647232,"date":"2023-02-08T17:56:47","date_gmt":"2023-02-08T22:56:47","guid":{"rendered":"https:\/\/lightning.ai\/pages\/?p=5647232"},"modified":"2023-07-28T11:26:50","modified_gmt":"2023-07-28T15:26:50","slug":"optimize-inference-scheduler","status":"publish","type":"post","link":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/","title":{"rendered":"Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler"},"content":{"rendered":"<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Key Takeaway<\/h3>Learn how to accelerate serving models with sequential inference steps, like Stable Diffusion.<\/div>\n<p>In this blog post, we demonstrate how we accelerated our serving of diffusion models by up to 18% for higher batch sizes and cover how to leverage expressive Lightning systems to design a new serving strategy for diffusion models.<\/p>\n<p>&nbsp;<\/p>\n<h2>What are diffusion models?<\/h2>\n<p>In <a href=\"https:\/\/arxiv.org\/pdf\/2006.11239.pdf\"><em>Denoising Diffusion Probabilistic Models <\/em><\/a>(2020), Jonathan Ho et al. introduce a new diffusion probabilistic model with a sampling strategy called DDPM. During training, random G<span class=\"discussion-id-5f03b4a7-d024-4326-a8d9-cd9c958af28c notion-enable-hover\" data-token-index=\"1\">aussian<\/span> noise is added gradually to the input image (the signal is destroyed) and the model is trained to predict the noise added to the image to recover the original image.<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5647234\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-26-at-08.35.00.png\" alt=\"\" width=\"1476\" height=\"352\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-26-at-08.35.00.png 1476w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-26-at-08.35.00-300x72.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-26-at-08.35.00-1024x244.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-26-at-08.35.00-300x72@2x.png 600w\" sizes=\"(max-width: 1476px) 100vw, 1476px\" \/><\/p>\n<p>During inference, starting from random Gaussian noise, a sampler uses the trained model to sequentially predict and apply modifications to the noise to uncover the hidden image. In other words, <span class=\"discussion-level-1 discussion-id-806ff447-e945-4bad-9762-614c787918eb notion-enable-hover\" data-token-index=\"1\">the model is given multiple opportunities to progressively improve upon its prediction. The total number of attempts it is given to improve upon its prediction is a hyperparameter referred to as the number of inference steps. In practice, we&#8217;ve found that on average 30 is a good number of inference steps.<\/span><\/p>\n<p>Here are the results where we used the same random seed but varied the number of model attempts from 1 to 30 steps with the following prompt: <em><span class=\"notion-enable-hover\" spellcheck=\"false\" data-token-index=\"1\">astronaut riding a horse, digital art, epic lighting, highly-detailed masterpiece trending HQ<\/span>.<\/em><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5647242\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/attempts-vs-output.png\" alt=\"\" width=\"1500\" height=\"1000\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/attempts-vs-output.png 1500w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/attempts-vs-output-300x200.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/attempts-vs-output-1024x683.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/attempts-vs-output-300x200@2x.png 600w\" sizes=\"(max-width: 1500px) 100vw, 1500px\" \/><\/p>\n<p>As you can see, the more attempts, the better the final prediction is.<\/p>\n<p>If you are interested in learning more about the theory behind diffusion models, we recommend <a href=\"https:\/\/jalammar.github.io\/illustrated-stable-diffusion\/\">The Illustrated Stable Diffusion<\/a> by Jay Alammar and <a href=\"https:\/\/theaisummer.com\/diffusion-models\/\">How diffusion models work: the math from scratch<\/a> by\u00a0<a href=\"https:\/\/theaisummer.com\/author\/Sergios-Karagiannakos\/\">Sergios Karagiannakos and<\/a> <a href=\"https:\/\/theaisummer.com\/author\/Nikolas-Adaloglou\/\">Nikolas Adaloglou<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<h1>Traditional inference method<\/h1>\n<p>When receiving multiple user requests, the current approach to serving is to group the prompts into a single input called \u201ca batch\u201d and run inference through the entire model using that same batch. Usually, the bigger the batch, the faster the inference per input is.<\/p>\n<p>Here is the pseudo-code associated with diffusion model inference. We pass the prompts to a text encoder, then run the encoded text and noisy images through the sample for the given number of inference steps.<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\n\nimgs = ... # batch of random noise images<br \/>\ntext_conditions = model.text_encoder([prompt_1, prompt_2, ...])\n\nfor _ in range(steps):<br \/>\nimgs = sampler.step(imgs, text_conditions, ...)\n\nfinal_imgs = imgs\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>Here is an example of the 4-step diffusion process with a fixed batch of size 4. Each element within the batch is at the same diffusion step (1, 2, 3, 4).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5647243\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-1.png\" alt=\"\" width=\"1800\" height=\"1200\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-1.png 1800w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-1-300x200.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-1-1024x683.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-1-1536x1024.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-1-300x200@2x.png 600w\" sizes=\"(max-width: 1800px) 100vw, 1800px\" \/><\/p>\n<p>In reality, however, requests to a server aren\u2019t made at the exact same time, and the first requests end up waiting for the next ones to compose a batch. Put another way, think of this like ordering at a coffee shop: you place your order, but your barista won&#8217;t start making your coffee until three additional people line up behind you and place their orders.<\/p>\n<p>Here is an illustration of the overall process:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5647245\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/lost-waiting-time.png\" alt=\"\" width=\"1800\" height=\"1200\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/lost-waiting-time.png 1800w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/lost-waiting-time-300x200.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/lost-waiting-time-1024x683.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/lost-waiting-time-1536x1024.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/lost-waiting-time-300x200@2x.png 600w\" sizes=\"(max-width: 1800px) 100vw, 1800px\" \/><\/p>\n<p>Additionally, once a batch is being processed, any new requests need to wait for the entire diffusion process to complete.<\/p>\n<p>The trade-off between delaying inference in order to fill the batch and running a partially filled batch must be carefully considered, and can ultimately result in degraded user experience and underutilized servers.<\/p>\n<p>Here are the logs for the current approach with a batch of 5:<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\n\ninputs=[Text(text='astronaut riding a horse, digital art, epic lighting, highly-detailed masterpiece trending HQ')]<br \/>\ninputs=[Text(text='portrait photo of a asia old warrior chief, tribal panther make up, blue on red, side profile')]<br \/>\ninputs=[Text(text='Keanu Reeves portrait photo of a asia old warrior chief, tribal panther make up, blue on red, side profile, looking away')]<br \/>\ninputs=[Text(text='astronaut riding a horse, digital art, epic lighting, highly-detailed masterpiece trending HQ')]<br \/>\n[0, 0, 0, 0]<br \/>\n[1, 1, 1, ]<br \/>\n[2, 2, 2, 2]<br \/>\n[3, 3, 3, 3]<br \/>\n[4, 4, 4, 4]<br \/>\n[5, 5, 5, 5]<br \/>\n[6, 6, 6, 6]<br \/>\n[7, 7, 7, 7]<br \/>\n[8, 8, 8, 8]<br \/>\n[9, 9, 9, 9]<br \/>\n[10, 10, 10 10]<br \/>\n[11, 11, 11, 11]<br \/>\n[12, 12, 12, 12]<br \/>\n[13, 13, 13, 13]<br \/>\n[14, 14, 14, 14]<br \/>\n[15, 15, 15, 15]<br \/>\n[16, 16, 16, 16]<br \/>\n[17, 17, 17, 17]<br \/>\n[18, 18, 18, 18]<br \/>\n[19, 19, 19, 19]<br \/>\n[20, 20, 20, 20]<br \/>\n[21, 21, 21, 21]<br \/>\n[22, 22, 22, 22]<br \/>\n[23, 23, 23, 23]<br \/>\n[24, 24, 24, 24]<br \/>\n[25, 25, 25, 25]<br \/>\n[26, 26, 26, 26]<br \/>\n[27, 27, 27, 27]<br \/>\n[29, 29, 29, 29]<br \/>\n[Response: ...]<br \/>\n[Response: ...]<br \/>\n[Response: ...]<br \/>\n[Response: ...]<br \/>\n...\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>What we&#8217;ve devised to improve both user experience and server utilization is a novel method of leveraging the sequential diffusion process, to simultaneously improve latency and utilization.<\/p>\n<p>&nbsp;<\/p>\n<h3>Leveraging the diffusion process to accelerate inference<\/h3>\n<p>One approach to accelerate serving is to rely on the sequential behavior of diffusion models. Rather than having a fixed batch size for the n-steps of the diffusion process, we can dynamically adapt the size of the batch at every sampler step depending on the number of pending requests. To do this, not only does the batch size need to change dynamically, but we also need to keep track of the progress steps associated with each element.<\/p>\n<p>Below is an illustration of this novel approach. When a new request is received, it is added to the current batch and processed in real-time. If an image has made its way through the entire diffusion process, it is removed from the batch.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5647246\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/time-to-response-2.png\" alt=\"\" width=\"1200\" height=\"800\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/time-to-response-2.png 1200w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/time-to-response-2-300x200.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/time-to-response-2-1024x683.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/time-to-response-2-300x200@2x.png 600w\" sizes=\"(max-width: 1200px) 100vw, 1200px\" \/><\/p>\n<p>Here is another illustration of the process described above with images:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5647247\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-3.png\" alt=\"\" width=\"1650\" height=\"1100\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-3.png 1650w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-3-300x200.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-3-1024x683.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-3-1536x1024.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/sampler-step-3-300x200@2x.png 600w\" sizes=\"(max-width: 1650px) 100vw, 1650px\" \/><\/p>\n<p>Here are the logs where we print the inputs, the steps for each sample in the batch, and the responses.<\/p>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\n\ninputs=[Text(text='astronaut riding a horse, digital art, epic lighting, highly-detailed masterpiece trending HQ')]<br \/>\n[0]<br \/>\n[1]<br \/>\n[2]<br \/>\n[3]<br \/>\n[4]<br \/>\n[5]<br \/>\n[6]<br \/>\n[7]<br \/>\n[8]<br \/>\n[9]<br \/>\n[10]<br \/>\n[11]<br \/>\n[12]<br \/>\n[13]<br \/>\n[14]<br \/>\n[15]<br \/>\n[16]<br \/>\n[17]<br \/>\n[18]<br \/>\n[19]<br \/>\n[20]<br \/>\n[21]<br \/>\n[22]<br \/>\n[23]<br \/>\ninputs=[Text(text='portrait photo of a asia old warrior chief, tribal panther make up, blue on red, side profile')]<br \/>\n[24, 0]<br \/>\n[25, 1]<br \/>\n[26, 2]<br \/>\n[27, 3]<br \/>\n[28, 4]<br \/>\ninputs=[Text(text='Keanu Reeves portrait photo of a asia old warrior chief, tribal panther make up, blue on red, side profile, looking away')]<br \/>\n[29, 5, 0]<br \/>\n[Response: ...]<br \/>\n[6, 1]<br \/>\n#\u00a0Note: This is where the previous approach starts.<br \/>\n# Request 1 has finished and 2, 3 have already started.<br \/>\ninputs=[Text(text='portrait photo of a african old warrior chief, tribal panther make up, gold on white, side profile, looking away, serious eyes, 50mm portrait photography')]<br \/>\n[7, 2, 0]<br \/>\n[8, 3, 1]<br \/>\n[9, 4, 2]<br \/>\n[10, 5, 3]<br \/>\n[11, 6, 4]<br \/>\n[12, 7, 5]<br \/>\n[13, 8, 6]<br \/>\n[14, 9, 7]<br \/>\n[15, 10, 8]<br \/>\n[16, 11, 9]<br \/>\n[17, 12, 10]<br \/>\n[18, 13, 11]<br \/>\n[19, 14, 12]<br \/>\n[20, 15, 13]<br \/>\n[21, 16, 14]<br \/>\n[22, 17, 15]<br \/>\n[23, 18, 16]<br \/>\n[24, 19, 17]<br \/>\n[25, 20, 18]<br \/>\n[26, 21, 19]<br \/>\n[27, 22, 20]<br \/>\n[28, 23, 21]<br \/>\n[29, 24, 22]<br \/>\n[Response: ...]<br \/>\n[25, 23]<br \/>\n[26, 24]<br \/>\n[27, 25]<br \/>\n[28, 26]<br \/>\n[29, 27]<br \/>\n[Response: ...]<br \/>\n[28]<br \/>\n[29]<br \/>\n[Response: ...]\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>&nbsp;<\/p>\n<p>Below is the code for the process described above where new requests are added dynamically to the model predict step. This is how it works:<\/p>\n<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Step 1<\/h3><\/p>\n<p>For the very first request, a prediction task is created to perform inference through the model. For subsequent requests, the request and its future result are stored in a dictionary.<\/p>\n<p><\/div>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\n\nasync def (self, request: BatchText):<br \/>\n#\u00a01. On very first batch, create predictor task<br \/>\nif self._lock is None:<br \/>\nself._lock = asyncio.Lock()<br \/>\nif self._predictor_task is None:<br \/>\nself._predictor_task = asyncio.create_task(self.predict_fn())<br \/>\nassert len(request.inputs) == 1\n\n# 2. Create future<br \/>\nfuture = asyncio.Future()\n\n# 3. Add the request to the requests dictionarry<br \/>\nasync with self._lock:<br \/>\nself._requests[uuid.uuid4().hex] = {<br \/>\n\"data\": request.inputs[0],<br \/>\n\"response\": future<br \/>\n}\n\n# 4. Wait for the request to be ready<br \/>\nresult = await future<br \/>\nreturn result\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>&nbsp;<\/p>\n<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Step 2<\/h3><\/p>\n<p>The <span class=\"discussion-id-6bb1a4d2-2cd9-4945-a416-a599121cc62b notion-enable-hover\" data-token-index=\"1\">predict<\/span>ion task looks over the available requests and <span class=\"discussion-id-ebedb843-2ccf-4cfa-92ed-7c42f5ebeaef discussion-id-3b224a5b-fdbf-479c-9172-6428d20f8475 notion-enable-hover\" data-token-index=\"3\">forward<\/span>s them through the model in the following format, so the model can keep track of each request progress independently with their ID.<\/p>\n<p><\/div>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\n\ninputs = {<br \/>\n\"ID_0\": \"prompt_0\",<br \/>\n\"ID_1\": \"prompt_1\",<br \/>\n...\n\n}\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>&nbsp;<\/p>\n<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Step 3<\/h3><\/p>\n<p>Each model <span class=\"discussion-id-50ac047c-4a56-48b4-a6cf-2ba89bbd88db notion-enable-hover\" data-token-index=\"1\">inference<\/span> step <span class=\"discussion-id-10dc1a60-9e5d-4395-a308-9aac7914bb9b notion-enable-hover\" data-token-index=\"3\">m<\/span>odifies the inputs in-place by replacing the prompt with a batch and sample state used to track the intermediate steps and generated images.<\/p>\n<p><\/div>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\n\ninputs = {<br \/>\n\"ID_0\": {\"img\": ..., \"step\": ...}<br \/>\n\"ID_1\": {\"img\": ..., \"step\": ...}<br \/>\n...<br \/>\n\"global_state\": {\"batch_img\": ..., \"batch_steps\": ...}\n\n}\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>&nbsp;<\/p>\n<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">Step 4<\/h3><\/p>\n<p>The states above are stored with the request for the next step generation. These states are stored after every sub inference. If a result is found e.g. an input has finished its diffusion process, it is attached as the result to the response future, unblocking the server response.<\/p>\n<p><\/div>\n<pre class=\"code-shortcode dark-theme window- collapse-false \" style=\"--height:falsepx\"><code class=\"language-python\">\n\nasync def predict_fn(self):<br \/>\nwhile True:<br \/>\nasync with self._lock:<br \/>\nkeys = list(self._requests)\n\nif len(keys) == 0:<br \/>\nawait asyncio.sleep(0.0001)<br \/>\ncontinue\n\n#\u00a0Prepare prompts for the model<br \/>\ninputs = {<br \/>\nkey: self.sanetize_data(self._requests[key])<br \/>\nfor key in keys<br \/>\n}<br \/>\n#\u00a0Apply model<br \/>\nresults = self.apply_model(inputs)\n\n#\u00a0Keep track of the state of each request<br \/>\nfor key, state in inputs.items():<br \/>\nif key == \"global_state\":<br \/>\nself._requests['global_state'] = {\"state\": state}<br \/>\nelse:<br \/>\nself._requests[key]['state'] = state\n\n#\u00a0If any results is available, make response ready.<br \/>\nif results:<br \/>\nfor key in results:<br \/>\nself._requests[key]['response'].set_result(<br \/>\nself.sanetize_results(results[key])<br \/>\n)<br \/>\ndel self._requests[key]\n\n# Sleep for python to check if any request has been received.<br \/>\nawait asyncio.sleep(0.0001)\n\n<\/code><div class=\"copy-button\"><button class=\"expand-button\">Expand<\/button><button class=\"copy\">Copy<\/button><\/div><\/pre>\n<p>You can explore the source code <a class=\"notion-link-token notion-enable-hover\" href=\"https:\/\/github.com\/Lightning-AI\/DiffusionWithAutoscaler\/blob\/main\/app_looping.py#L85\" target=\"_blank\" rel=\"noopener noreferrer\" data-token-index=\"1\"><span class=\"link-annotation-unknown-block-id-80269320\">here<\/span><\/a>. Additionally, the model inference step tracks each element as it progresses through the diffusion steps within the batch. You can explore the new model inference step source code <a class=\"notion-link-token notion-enable-hover\" href=\"https:\/\/github.com\/Lightning-AI\/stablediffusion\/blob\/lit\/ldm\/lightning.py#L136\" target=\"_blank\" rel=\"noopener noreferrer\" data-token-index=\"3\"><span class=\"link-annotation-unknown-block-id--1529254727\">here<\/span><\/a>.<\/p>\n<p>&nbsp;<\/p>\n<h3>Benchmarking Serving Strategies<\/h3>\n<p>To benchmark our new serving strategy, we deployed both versions on <a href=\"https:\/\/lightning.ai\/\">lightning.ai<\/a> cover both T4 and A10 GPU machines. You can find the scripts <a href=\"https:\/\/github.com\/Lightning-AI\/DiffusionWithAutoscaler\/blob\/main\/app.py\">here<\/a> and <a href=\"https:\/\/github.com\/Lightning-AI\/DiffusionWithAutoscaler\/blob\/main\/app_looping.py\">here<\/a>, respectively.<\/p>\n<p>We then deployed a <a href=\"https:\/\/github.com\/locustio\/locust\">Locust<\/a> server that creates multiple http users to load test the servers and collect benchmarks. The code for that is <a href=\"https:\/\/github.com\/Lightning-AI\/DiffusionWithAutoscaler\/blob\/main\/loadtest\/app.py\">here<\/a>.<\/p>\n<p>The new approach resulted in speedups ranging from 3-12.8% on A10 and 2-18.5% on T4.<\/p>\n<p>Here are the benchmarks with an A10 GPU:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5647240\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.20.19.png\" alt=\"\" width=\"1556\" height=\"638\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.20.19.png 1556w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.20.19-300x123.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.20.19-1024x420.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.20.19-1536x630.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.20.19-300x123@2x.png 600w\" sizes=\"(max-width: 1556px) 100vw, 1556px\" \/><\/p>\n<p>Here are the benchmarks with a T4 GPU:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-5647241\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.18.16.png\" alt=\"\" width=\"1550\" height=\"380\" srcset=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.18.16.png 1550w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.18.16-300x74.png 300w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.18.16-1024x251.png 1024w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.18.16-1536x377.png 1536w, https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Screenshot-2023-01-24-at-17.18.16-300x74@2x.png 600w\" sizes=\"(max-width: 1550px) 100vw, 1550px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h3>Benchmark it yourself for free<\/h3>\n<ol>\n<li>Create a Lightning account and get $30USD worth of <span class=\"mui_tooltip wrapped\"><span class=\"tooltip_wrap\">credits<img decoding=\"async\" class=\"ml-1\" width=\"12.5\" height=\"12.5\" alt=\"tooltip icon\" src=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/themes\/lightning-wp\/assets\/images\/tooltip.svg\"><span class=\"tooltip_content\">Lightning Credits are used to pay for cloud compute<\/span><\/span><\/span> for free.<\/li>\n<li>Duplicate the <a href=\"https:\/\/lightning.ai\/app\/fcUubSZ99Q\"><strong>Autoscaled Stable Diffusion Server<\/strong><\/a> Recipe on your Lightning account<\/li>\n<li>Use the <a class=\"notion-link-token notion-enable-hover\" href=\"https:\/\/github.com\/Lightning-AI\/DiffusionWithAutoscaler\" target=\"_blank\" rel=\"noopener noreferrer\" data-token-index=\"1\"><span class=\"link-annotation-unknown-block-id-1357210152\">DiffusionWithAutoscaler<\/span><\/a> github repository to replicate the benchmark.<\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<div class=\"takeaways card-glow p-4 my-4\"><h3 class=\"w-100 d-block\">We want to hear from you!<\/h3> We&#8217;re always looking to improve Lightning alongside the people using it every single day to build ML. If you have questions, feedback, or want to connect with our team, reach out via talktous@lightning.ai or on <a href=\"https:\/\/discord.gg\/XncpTy7DSt\">our Discord<\/a>. <\/div>\n","protected":false},"excerpt":{"rendered":"<p>In this blog post, we demonstrate how we accelerated our serving of diffusion models by up to 18% for higher batch sizes and cover how to leverage expressive Lightning systems to design a new serving strategy for diffusion models. &nbsp; What are diffusion models? In Denoising Diffusion Probabilistic Models (2020), Jonathan Ho et al. introduce<a class=\"excerpt-read-more\" href=\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/\" title=\"ReadAccelerate Serving Stable Diffusion by Optimizing the Inference Scheduler\">&#8230; Read more &raquo;<\/a><\/p>\n","protected":false},"author":38,"featured_media":5647248,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"inline_featured_image":false,"footnotes":"","_links_to":"","_links_to_target":""},"categories":[106,41],"tags":[96,97,141,114],"glossary":[],"acf":{"additional_authors":false,"hide_from_archive":false,"content_type":"Blog Post","custom_styles":"","mathjax":false,"default_editor":true,"show_table_of_contents":false,"sticky":false},"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler<\/title>\n<meta name=\"description\" content=\"In this blog post, we demonstrate how we accelerated our serving of diffusion models by up to 18% for higher batch sizes.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler\" \/>\n<meta property=\"og:description\" content=\"In this blog post, we demonstrate how we accelerated our serving of diffusion models by up to 18% for higher batch sizes.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/\" \/>\n<meta property=\"og:site_name\" content=\"Lightning AI\" \/>\n<meta property=\"article:published_time\" content=\"2023-02-08T22:56:47+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2023-07-28T15:26:50+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Sequential-featured.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1595\" \/>\n\t<meta property=\"og:image:height\" content=\"825\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Thomas Chaton\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:creator\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:site\" content=\"@LightningAI\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Thomas Chaton\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"10 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/\"},\"author\":{\"name\":\"Thomas Chaton\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/a5c2133ac25a788147b115979a5fc2bf\"},\"headline\":\"Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler\",\"datePublished\":\"2023-02-08T22:56:47+00:00\",\"dateModified\":\"2023-07-28T15:26:50+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/\"},\"wordCount\":1575,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Sequential-featured.png\",\"keywords\":[\"ai\",\"ml\",\"model serving\",\"stable diffusion\"],\"articleSection\":[\"Community\",\"Tutorials\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/\",\"url\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/\",\"name\":\"Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler\",\"isPartOf\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Sequential-featured.png\",\"datePublished\":\"2023-02-08T22:56:47+00:00\",\"dateModified\":\"2023-07-28T15:26:50+00:00\",\"description\":\"In this blog post, we demonstrate how we accelerated our serving of diffusion models by up to 18% for higher batch sizes.\",\"breadcrumb\":{\"@id\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#primaryimage\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Sequential-featured.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Sequential-featured.png\",\"width\":1595,\"height\":825},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/lightning.ai\/pages\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/lightning.ai\/pages\/#website\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"name\":\"Lightning AI\",\"description\":\"The platform for teams to build AI.\",\"publisher\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/lightning.ai\/pages\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/lightning.ai\/pages\/#organization\",\"name\":\"Lightning AI\",\"url\":\"https:\/\/lightning.ai\/pages\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"contentUrl\":\"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png\",\"width\":1744,\"height\":856,\"caption\":\"Lightning AI\"},\"image\":{\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/\"},\"sameAs\":[\"https:\/\/x.com\/LightningAI\"]},{\"@type\":\"Person\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/a5c2133ac25a788147b115979a5fc2bf\",\"name\":\"Thomas Chaton\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/e8a8ea2ae1fd0f2d476f8bc75e195b3d?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/e8a8ea2ae1fd0f2d476f8bc75e195b3d?s=96&d=mm&r=g\",\"caption\":\"Thomas Chaton\"},\"url\":\"https:\/\/lightning.ai\/pages\/author\/thomaschaton\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler","description":"In this blog post, we demonstrate how we accelerated our serving of diffusion models by up to 18% for higher batch sizes.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/","og_locale":"en_US","og_type":"article","og_title":"Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler","og_description":"In this blog post, we demonstrate how we accelerated our serving of diffusion models by up to 18% for higher batch sizes.","og_url":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/","og_site_name":"Lightning AI","article_published_time":"2023-02-08T22:56:47+00:00","article_modified_time":"2023-07-28T15:26:50+00:00","og_image":[{"width":1595,"height":825,"url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Sequential-featured.png","type":"image\/png"}],"author":"Thomas Chaton","twitter_card":"summary_large_image","twitter_creator":"@LightningAI","twitter_site":"@LightningAI","twitter_misc":{"Written by":"Thomas Chaton","Est. reading time":"10 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#article","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/"},"author":{"name":"Thomas Chaton","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/a5c2133ac25a788147b115979a5fc2bf"},"headline":"Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler","datePublished":"2023-02-08T22:56:47+00:00","dateModified":"2023-07-28T15:26:50+00:00","mainEntityOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/"},"wordCount":1575,"commentCount":0,"publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Sequential-featured.png","keywords":["ai","ml","model serving","stable diffusion"],"articleSection":["Community","Tutorials"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/","url":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/","name":"Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler","isPartOf":{"@id":"https:\/\/lightning.ai\/pages\/#website"},"primaryImageOfPage":{"@id":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#primaryimage"},"image":{"@id":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#primaryimage"},"thumbnailUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Sequential-featured.png","datePublished":"2023-02-08T22:56:47+00:00","dateModified":"2023-07-28T15:26:50+00:00","description":"In this blog post, we demonstrate how we accelerated our serving of diffusion models by up to 18% for higher batch sizes.","breadcrumb":{"@id":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#primaryimage","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Sequential-featured.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/Sequential-featured.png","width":1595,"height":825},{"@type":"BreadcrumbList","@id":"https:\/\/lightning.ai\/pages\/community\/optimize-inference-scheduler\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/lightning.ai\/pages\/"},{"@type":"ListItem","position":2,"name":"Accelerate Serving Stable Diffusion by Optimizing the Inference Scheduler"}]},{"@type":"WebSite","@id":"https:\/\/lightning.ai\/pages\/#website","url":"https:\/\/lightning.ai\/pages\/","name":"Lightning AI","description":"The platform for teams to build AI.","publisher":{"@id":"https:\/\/lightning.ai\/pages\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/lightning.ai\/pages\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/lightning.ai\/pages\/#organization","name":"Lightning AI","url":"https:\/\/lightning.ai\/pages\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/","url":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","contentUrl":"https:\/\/lightningaidev.wpengine.com\/wp-content\/uploads\/2023\/02\/image-17.png","width":1744,"height":856,"caption":"Lightning AI"},"image":{"@id":"https:\/\/lightning.ai\/pages\/#\/schema\/logo\/image\/"},"sameAs":["https:\/\/x.com\/LightningAI"]},{"@type":"Person","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/a5c2133ac25a788147b115979a5fc2bf","name":"Thomas Chaton","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/lightning.ai\/pages\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/e8a8ea2ae1fd0f2d476f8bc75e195b3d?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/e8a8ea2ae1fd0f2d476f8bc75e195b3d?s=96&d=mm&r=g","caption":"Thomas Chaton"},"url":"https:\/\/lightning.ai\/pages\/author\/thomaschaton\/"}]}},"_links":{"self":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5647232"}],"collection":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/users\/38"}],"replies":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/comments?post=5647232"}],"version-history":[{"count":0,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/posts\/5647232\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media\/5647248"}],"wp:attachment":[{"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/media?parent=5647232"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/categories?post=5647232"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/tags?post=5647232"},{"taxonomy":"glossary","embeddable":true,"href":"https:\/\/lightning.ai\/pages\/wp-json\/wp\/v2\/glossary?post=5647232"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}