Skip to content

Commit

Permalink
Support ollama
Browse files Browse the repository at this point in the history
support ollama
  • Loading branch information
qinguoyi committed Nov 3, 2024
1 parent 828ac74 commit 05078e4
Show file tree
Hide file tree
Showing 13 changed files with 166 additions and 19 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Easy, advanced inference platform for large language models on Kubernetes
## Features Overview

- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp), [ollama](https://github.com/ollama/ollama). Find the full list of supported backends [here](./docs/support-backends.md).
- **Scaling Efficiency (WIP)**: llmaz works smoothly with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) to support elastic scenarios.
- **Accelerator Fungibility (WIP)**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
Expand Down
36 changes: 36 additions & 0 deletions chart/templates/backends/ollama.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
{{- if .Values.backendRuntime.install -}}
apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
labels:
app.kubernetes.io/name: backendruntime
app.kubernetes.io/part-of: llmaz
app.kubernetes.io/created-by: llmaz
name: ollama
spec:
commands:
- sh
- -c
image: ollama/ollama
version: latest
# Do not edit the preset argument name unless you know what you're doing.
# Free to add more arguments with your requirements.
args:
- name: default
flags:
- "ollama serve &
while true; do output=$(ollama list 2>&1);
if ! echo $output | grep -q 'could not connect to ollama app' && echo $output | grep -q 'NAME';then echo 'ollama is running';break; else echo 'Waiting for the ollama to be running...';sleep 1;fi;done;
ollama run {{`{{ .ModelName }}`}};
while true;do sleep 60;done"
envs:
- name: OLLAMA_HOST
value: 0.0.0.0:8080
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
{{- end }}
7 changes: 6 additions & 1 deletion docs/examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ We provide a set of examples to help you serve large language models, by default
- [Deploy models via SGLang](#deploy-models-via-sglang)
- [Deploy models via llama.cpp](#deploy-models-via-llamacpp)
- [Deploy models via text-generation-inference](#deploy-models-via-tgi)
- [Deploy models via ollama](#ollama)
- [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)

### Deploy models from Huggingface
Expand All @@ -32,7 +33,7 @@ In theory, if we want to load the `Qwen2-7B` model, which occupies about 14.2 GB

- Alibaba Cloud OSS, see [example](./objstore-oss/) here

> Note: you should set OSS_ACCESS_KEY_ID and OSS_ACCESS_kEY_SECRET first by running `kubectl create secret generic oss-access-secret --from-literal=OSS_ACCESS_KEY_ID=<your ID> --from-literal=OSS_ACCESS_kEY_SECRET=<your secret>`
> Note: you should set OSS_ACCESS_KEY_ID and OSS_ACCESS_kEY_SECRET first by running `kubectl create secret generic oss-access-secret --from-literal=OSS_ACCESS_KEY_ID=<your ID> --from-literal=OSS_ACCESS_kEY_SECRET=<your secret>`
### Deploy models via SGLang

Expand All @@ -46,6 +47,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference

[text-generation-inference](https://github.com/huggingface/text-generation-inference) is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see [example](./tgi/) here.

### Deploy models via ollama

[ollama](https://github.com/ollama/ollama) based on llama.cpp, aims for local deploy. see [example](./ollama/) here.

### Speculative Decoding with vLLM

[Speculative Decoding](https://arxiv.org/abs/2211.17192) can improve inference performance efficiently, see [example](./speculative-decoding/vllm/) here.
8 changes: 8 additions & 0 deletions docs/examples/ollama/model.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
name: qwen2-0--5b
spec:
familyName: qwen2
source:
uri: OLLAMA://qwen2:0.5b
10 changes: 10 additions & 0 deletions docs/examples/ollama/playground.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
name: qwen2-0--5b
spec:
replicas: 1
modelClaim:
modelName: qwen2-0--5b
backendRuntimeConfig:
name: ollama
4 changes: 4 additions & 0 deletions docs/support-backends.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ If you want to integrate more backends into llmaz, please refer to this [PR](htt

[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.

## ollama

[ollama](https://github.com/ollama/ollama) is running with Llama 3.2, Mistral, Gemma 2, and other large language models, based on llama.cpp, aims for local deploy.

## vLLM

[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs
26 changes: 17 additions & 9 deletions pkg/controller_helper/backendruntime.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ package helper
import (
"fmt"
"regexp"
"strings"

corev1 "k8s.io/api/core/v1"

Expand Down Expand Up @@ -94,18 +95,25 @@ func (p *BackendRuntimeParser) Resources() inferenceapi.ResourceRequirements {
func renderFlags(flags []string, modelInfo map[string]string) ([]string, error) {
// Capture the word.
re := regexp.MustCompile(`\{\{\s*\.(\w+)\s*\}\}`)

res := []string{}
var value string

for _, flag := range flags {
value = flag
match := re.FindStringSubmatch(flag)
if len(match) > 1 {
// Return the matched word.
value = modelInfo[match[1]]

if value == "" {
return nil, fmt.Errorf("missing flag or the flag has format error: %s", flag)
value := flag
matches := re.FindAllStringSubmatch(flag, -1)

if len(matches) > 0 {
for _, match := range matches {
if len(match) > 1 {
// get key
key := match[1]
replacement, exists := modelInfo[key]
if !exists {
return nil, fmt.Errorf("missing flag or the flag has format error: %s", flag)
}
// replace
value = strings.Replace(value, match[0], replacement, -1)
}
}
}

Expand Down
9 changes: 9 additions & 0 deletions pkg/controller_helper/backendruntime_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,15 @@ func TestRenderFlags(t *testing.T) {
wantFlags []string
wantError bool
}{
{
name: "normal parse long args",
flags: []string{"run {{ .ModelPath }};sleep 5", "--host", "0.0.0.0"},
modelInfo: map[string]string{
"ModelPath": "path/to/model",
"ModelName": "foo",
},
wantFlags: []string{"run path/to/model;sleep 5", "--host", "0.0.0.0"},
},
{
name: "normal parse",
flags: []string{"-m", "{{ .ModelPath }}", "--served-model-name", "{{ .ModelName }}", "--host", "0.0.0.0"},
Expand Down
3 changes: 2 additions & 1 deletion pkg/controller_helper/model_source/modelsource.go
Original file line number Diff line number Diff line change
Expand Up @@ -72,11 +72,12 @@ func NewModelSourceProvider(model *coreapi.OpenModel) ModelSourceProvider {
if model.Spec.Source.URI != nil {
// We'll validate the format in the webhook, so generally no error should happen here.
protocol, address, _ := util.ParseURI(string(*model.Spec.Source.URI))
provider := &URIProvider{modelName: model.Name, protocol: protocol}
provider := &URIProvider{modelName: model.Name, protocol: protocol, modelAddress: address}

switch protocol {
case OSS:
provider.endpoint, provider.bucket, provider.modelPath, _ = util.ParseOSS(address)
case OLLAMA:
default:
// This should be validated at webhooks.
panic("protocol not supported")
Expand Down
20 changes: 14 additions & 6 deletions pkg/controller_helper/model_source/uri.go
Original file line number Diff line number Diff line change
Expand Up @@ -26,18 +26,23 @@ import (
var _ ModelSourceProvider = &URIProvider{}

const (
OSS = "OSS"
OSS = "OSS"
OLLAMA = "OLLAMA"
)

type URIProvider struct {
modelName string
protocol string
bucket string
endpoint string
modelPath string
modelName string
protocol string
bucket string
endpoint string
modelPath string
modelAddress string
}

func (p *URIProvider) ModelName() string {
if p.protocol == OLLAMA {
return p.modelAddress
}
return p.modelName
}

Expand All @@ -58,6 +63,9 @@ func (p *URIProvider) ModelPath() string {
}

func (p *URIProvider) InjectModelLoader(template *corev1.PodTemplateSpec, index int) {
if p.protocol == OLLAMA {
return
}
initContainerName := MODEL_LOADER_CONTAINER_NAME
if index != 0 {
initContainerName += "-" + strconv.Itoa(index)
Expand Down
3 changes: 2 additions & 1 deletion pkg/webhook/openmodel_webhook.go
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,8 @@ func SetupOpenModelWebhook(mgr ctrl.Manager) error {
var _ webhook.CustomDefaulter = &OpenModelWebhook{}

var SUPPORTED_OBJ_STORES = map[string]struct{}{
modelSource.OSS: {},
modelSource.OSS: {},
modelSource.OLLAMA: {},
}

// Default implements webhook.Defaulter so a webhook will be registered for the type
Expand Down
29 changes: 29 additions & 0 deletions test/config/backends/ollama.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
apiVersion: inference.llmaz.io/v1alpha1
kind: BackendRuntime
metadata:
labels:
app.kubernetes.io/name: backendruntime
app.kubernetes.io/part-of: llmaz
app.kubernetes.io/created-by: llmaz
name: ollama
spec:
commands:
- sh
- -c
image: ollama/ollama
version: latest
args:
- name: default
flags:
- "ollama serve &
while true; do output=$(ollama list 2>&1);
if ! echo $output | grep -q 'could not connect to ollama app' && echo $output | grep -q 'NAME';then echo 'ollama is running';break; else echo 'Waiting for the ollama to be running...';sleep 1;fi;done;
ollama run {{`{{ .ModelName }}`}};
while true;do sleep 60;done"
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 2
memory: 4Gi
28 changes: 28 additions & 0 deletions test/integration/controller/inference/playground_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -264,6 +264,34 @@ var _ = ginkgo.Describe("playground controller test", func() {
},
},
}),
ginkgo.Entry("advance configured Playground with ollama", &testValidatingCase{
makePlayground: func() *inferenceapi.Playground {
return wrapper.MakePlayground("playground", ns.Name).ModelClaim(model.Name).Label(coreapi.ModelNameLabelKey, model.Name).
BackendRuntime("ollama").BackendRuntimeVersion("main").BackendRuntimeArgs([]string{"--foo", "bar"}).BackendRuntimeEnv("FOO", "BAR").
BackendRuntimeRequest("cpu", "1").BackendRuntimeLimit("cpu", "10").
Obj()
},
updates: []*update{
{
updateFunc: func(playground *inferenceapi.Playground) {
gomega.Expect(k8sClient.Create(ctx, playground)).To(gomega.Succeed())
},
checkFunc: func(ctx context.Context, k8sClient client.Client, playground *inferenceapi.Playground) {
validation.ValidatePlayground(ctx, k8sClient, playground)
validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundProgressing, "Pending", metav1.ConditionTrue)
},
},
{
updateFunc: func(playground *inferenceapi.Playground) {
util.UpdateLwsToReady(ctx, k8sClient, playground.Name, playground.Namespace)
},
checkFunc: func(ctx context.Context, k8sClient client.Client, playground *inferenceapi.Playground) {
validation.ValidatePlayground(ctx, k8sClient, playground)
validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundAvailable, "PlaygroundReady", metav1.ConditionTrue)
},
},
},
}),
ginkgo.Entry("playground is created when service exists with the same name", &testValidatingCase{
makePlayground: func() *inferenceapi.Playground {
return util.MockASamplePlayground(ns.Name)
Expand Down

0 comments on commit 05078e4

Please sign in to comment.