Replies: 2 comments 1 reply
-
This sounds like a llama.cpp problem, not lmql related. Make sure you install llama cpp with the following command:
|
Beta Was this translation helpful? Give feedback.
0 replies
-
Yes turn out it is a llama-cpp-python problem. Because it is working with llama.cpp. Maybe they will fix in future release, they are always far behind llama.cpp |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, when trying to use the model in serve-model, the model always gets reloaded in cpu instead using the model loaded in gpu.
i am able to load my model in local using all 3 gpu with code:
query_string = '''
"Q: In one word, What is the captital of {country}? \n"
"A: [CAPITAL] \n"
"Q: What is the main sight in {CAPITAL}? \n"
"A: [ANSWER]" where (len(TOKENS(CAPITAL)) < 10)
and (len(TOKENS(ANSWER)) < 200) and STOPS_AT(CAPITAL, '\n')
and STOPS_AT(ANSWER, '\n')
'''
print(lmql.run_sync(query_string,
country="united kingdom",
model = lmql.model("local:llama.cpp:/home/ebudmada/llama.cpp/phind-codellama-34b-v2.Q8_0.gguf",
cuda=True,
n_ctx=512,
n_gpu_layers=-1,
tokenizer = 'Phind/Phind-CodeLlama-34B-v2')).variables)
but i am unable to use the serve-model instance: loading using terminal: lmql serve-model llama.cpp:/home/ebudmada/llama.cpp/phind-codellama-34b-v2.Q8_0.gguf --cuda --n_ctx=512 n_gpu_layer=-1 --trust_remote_code True
[Serving LMTP endpoint on ws://localhost:8080/]
I see the gpu ram being used a little (276mb each 3x)
but when i try to use the model with my python script, it is always loading a new model in cpu. : i try a lot of things, but essentially using:
print(lmql.run_sync(query_string,
country="united kingdom",
model = lmql.model("llama.cpp:/home/ebudmada/llama.cpp/phind-codellama-34b-v2.Q8_0.gguf",
cuda=True,
n_ctx=512,
n_gpu_layers=-1,
endpoint="localhost:8080",
tokenizer = 'Phind/Phind-CodeLlama-34B-v2')).variables)
can anyone help me? Thank you
Beta Was this translation helpful? Give feedback.
All reactions