LMDeploy offers functionalities such as model quantization, offline batch inference, online serving, etc. Each function can be completed with just a few simple lines of code or commands.
Install lmdeploy with pip (python 3.8+) or from source
pip install lmdeploy
The default prebuilt package is compiled on CUDA 12. However, if CUDA 11+ is required, you can install lmdeploy by:
export LMDEPLOY_VERSION=0.3.0
export PYTHON_VERSION=38
pip install https://github.com/InternLM/lmdeploy/releases/download/v${LMDEPLOY_VERSION}/lmdeploy-${LMDEPLOY_VERSION}+cu118-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-manylinux2014_x86_64.whl --extra-index-url https://download.pytorch.org/whl/cu118
import lmdeploy
pipe = lmdeploy.pipeline("internlm/internlm-chat-7b")
response = pipe(["Hi, pls intro yourself", "Shanghai is"])
print(response)
For more information on inference pipeline parameters, please refer to here.
LMDeploy offers various serving methods, choosing one that best meet your requirements.
LMDeploy provides the following quantization methods. Please visit the following links for the detailed guide
LMDeploy CLI offers the following utilities, helping users experience LLM features conveniently
lmdeploy chat internlm/internlm-chat-7b
LMDeploy adopts gradio to develop the online demo.
# install dependencies
pip install lmdeploy[serve]
# launch gradio server
lmdeploy serve gradio internlm/internlm-chat-7b