diff --git a/src/content/Blog/running-vllm-on-akash/index.md b/src/content/Blog/running-vllm-on-akash/index.md index 982d0090..9ef47879 100644 --- a/src/content/Blog/running-vllm-on-akash/index.md +++ b/src/content/Blog/running-vllm-on-akash/index.md @@ -26,7 +26,7 @@ contributors: bannerImage: ./banner-image2.png --- -*By [Logan Cerkovnik](https://www.twitter.com/ThumperAI) & [Anil Murty](https://twitter.com/_Anil_Murty_)* +By [Logan Cerkovnik](https://www.twitter.com/ThumperAI) & [Anil Murty](https://twitter.com/_Anil_Murty_) There has been a proliferation of LLM services over the last several months and it’s great to see some of these be made available open source. Ollama is one of the early solutions that gained a significant amount of popularity among developers and has helped many developers accelerate their AI application development using open source AI models. A more recent solution is vLLM - that aims to overcome some of the limitations of Ollama. This post delves into what vLLM is and when and why developers should consider using it. Lastly, it also demonstrates how you can run vLLM easily on Akash Network. @@ -35,7 +35,7 @@ There has been a proliferation of LLM services over the last several months and ## Delving into vLLM vLLM is an LLM server implementation first introduced in a paper last year. Its primary objective was to make LLM inference faster for multiuser services. What vLLM does also achieve is overcoming some of the limitations of Ollama. vLLM enables you to serve more than 1 user at a time, natively, without having to proxy user requests between multiple GPUs. It also allows has somewhere between 2-4X the throughput of Ollama for concurrent requests. -The main change vLLM makes is adding Paged Attention to an LLM model by swapping out all the transformer attention modules for Paged Attention which implements attention more efficiently. The authors of vLLM describe Page Attention as, “Paged Attention’s memory sharing greatly reduces the memory overhead of complex sampling algorithms, such as parallel sampling and beam search, cutting their memory usage by up to 55%. This can translate into up to 2.2x improvement in throughput”. You can read more about the technical details of paged attention on the vLLM blog at https://blog.vLLM.ai/2023/06/20/vLLM.html. The current server implementation has gone beyond just Paged Attention and will soon support speculative encoding approaches. Other open source alternatives to vLLM include HuggingFace’s TGI and the sglang engine with its Radix Attention implementation. The only drawback to using vLLM is that it doesn’t support all of the super low quantization methods and file formats such as GGUF. If you haven’t used GGUF before in llama.cpp-based tools like ollama then you should note that most people actively try to avoid using models with quantization lower than the 4bit (Q4) quantization due to performance issues. The good news is that most models are available in GPTQ or AWQ quantization formats that are supported by vLLM. +The main change vLLM makes is adding Paged Attention to an LLM model by swapping out all the transformer attention modules for Paged Attention which implements attention more efficiently. The authors of vLLM describe Page Attention as, “Paged Attention’s memory sharing greatly reduces the memory overhead of complex sampling algorithms, such as parallel sampling and beam search, cutting their memory usage by up to 55%. This can translate into up to 2.2x improvement in throughput”. You can read more about the technical details of paged attention in the [offical vLLM blog post](https://blog.vllm.ai/2023/06/20/vllm.html). The current server implementation has gone beyond just Paged Attention and will soon support speculative encoding approaches. Other open source alternatives to vLLM include HuggingFace’s TGI and the sglang engine with its Radix Attention implementation. The only drawback to using vLLM is that it doesn’t support all of the super low quantization methods and file formats such as GGUF. If you haven’t used GGUF before in llama.cpp-based tools like ollama then you should note that most people actively try to avoid using models with quantization lower than the 4bit (Q4) quantization due to performance issues. The good news is that most models are available in GPTQ or AWQ quantization formats that are supported by vLLM. ![Benchmarks](./imgs/benchmarks.png) @@ -43,17 +43,17 @@ At the time of the original paper (June 2023), vLLM dramatically outperformed TG ![Compare](./imgs/compare.png) -## Links +### Learn more about vLLM * [vLLM Docs](https://docs.vLLM.ai) * [vLLM repo](https://github.com/vLLM-project/vLLM) -* [In depth comparison study](https://pages.run.ai/hubfs/PDFs/Serving-Large-Language-Models-Run-ai-Benchmarking-Study.pdf) +* [In-depth comparison study](https://pages.run.ai/hubfs/PDFs/Serving-Large-Language-Models-Run-ai-Benchmarking-Study.pdf) ## Preparation -* Create an Akash account and ensure you have AKT tokens. -* Login to console.akash.network with your wallet to launch an instance with an SDL (YAML) found in vLLM folder of the awesome akash [repo](https://github.com/akash-network/awesome-akash/vLLM/) +1. Create an Akash account and ensure you have AKT tokens. +2. Login to console.akash.network with your wallet to launch an instance with an SDL (YAML) found in vLLM folder of the [Awesome-Akash repository](https://github.com/akash-network/awesome-akash/vLLM/) ## Containerization -We are going to use the latest official vLLM container image: `vLLM/vLLM-openai:v0.4.0.post1` . +We are going to use the latest official vLLM container image: `vLLM/vLLM-openai:v0.4.0.post1` You can also build your own image using the Dockerfile in the root of the vLLM repo. @@ -62,13 +62,12 @@ Note: you should never use latest as a tag for your containers in Akash SDL and ## Deployment 1. **Create a Deployment Configuration**: Create a YAML file for your vLLM deployment, including Docker configurations, resource requirements, and port exposures. See the example below which you should be able to copy and paste into Akash Console. 2. **Deploy**: Use Akash Console to deploy your application, which matches you with a suitable provider based on your deployment specifications. -3. **Use LLM UI** : After deployment, utilize the Akash Console field to find the IP address of the service and you should be good to go. -4. **Use LLM API** : After deployment, utilize the Akash Console field to find the IP address of the vLLM service and add the URI and API key variables to whichever client you are using. - E.g. "http://localhost:8000/v1" +3. **Use LLM UI**: After deployment, utilize the Akash Console field to find the IP address of the service and you should be good to go. +4. **Use LLM API**: After deployment, utilize the Akash Console field to find the IP address of the vLLM service and add the URI and API key variables to whichever client you are using (e.g. "http://localhost:8000/v1"). -You can find an example of using CrewAI in the vLLM_crew_notebook_deployment.yml +You can find an example of using CrewAI in the `vLLM_crew_notebook_deployment.yml`. -Below is a code snippet using the llm with Langchain in Python. Tool calling should be supported pretty well by any model that is as performant as WizardLM2 7B or better. +Below is a code snippet using the LLM with Langchain in Python. Tool calling should be supported pretty well by any model that is as performant as WizardLM2-7B or better. ```python import os @@ -91,12 +90,9 @@ print(llm.invoke("Rome is")) ``` - The vLLM server is designed to be compatible with the OpenAI API, allowing you to use it as a drop-in replacement for applications using the OpenAI API. -This Repo contains 4 example vLLM YAMLs -One example without a user interface and 3 with the awesome openwebui tool - +This repository contains 4 example vLLM YAMLs, one example without a user interface and 3 with the OpenWebUI tool: * vLLM_no_ui_deployment.yml a basic example without a user interface * vLLM_with_openwebui_dolphin2-9-llama3-70b.yml * vLLM_with_openwebui_wizardlm2-7b.yml @@ -112,7 +108,6 @@ The vLLM server supports the following OpenAI API endpoints: Sizing LLM Server resources for a particular application can be challenging because of the impact of model choice, quantization of that model, GPU hardware, and usage pattern ( human being vs agent). Anyscale ( the company behind Ray) has released a great LLM benchmarking tool called llmperf that is worth using for benchmarking your use case with your specific application requirements. Aside from using this tool, it has been reported that a single Nvidia A100 GPU can support between 10-20 concurrent users for 7B parameter Mistral model with AWQ on vLLM with lower throughput for other server options. Larger models will have lower throughput. There are also a lot of performance improvements going from 1 to 2 GPUs in a vLLM server, but this effect diminishes rapidly. ## Troubleshooting - If you don't see the model in the OpenWebUI dropdown it usually means you chose a model that was too large for GPU, too large for disk space, or has a bad Hugginface repo entered into the deployment. You can usually verify which issue is from the logs. You will usually have to redeploy the deployment to change these parameters. Steps to Troubleshoot @@ -128,7 +123,6 @@ Steps to Troubleshoot 10. if you have checked all of these and still have problems than open a issue in the awesome-akash repo and tag @rakataprime. In the issue, please provide your logs and deployment used with the Hugginface token and other secrets set to XXXXXXXXXXXX. ## Choosing The Right GPU for An LLM Model - - vLLM supported files formats include : GPTQ, AWQ, GGML(squeezelm ), and pytorch pth/bin and safetensors files - vLLM DOES NOT SUPPORT GGUF files - Check to make sure the base foundation model is supported [here](https://docs.vLLM.ai/en/latest/models/supported_models.html)