Inference speed exl2 vs gguf - are my results typical? #471
Closed
LlamaEnjoyer
started this conversation in
General
Replies: 1 comment
-
Got my answers in a reddit thread. Closing. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi folks!
I've been toying around with LLMs for the past few weeks which became my new hobby :) I started out with LM studio, but recently I've installed Exui to see for myself if the exl2 really that awesome. Putting the hurdle-hopping to get it up and running on my Windows PC aside, I've decided to run a quick speed test using the Llama 3 8B Instruct Q8.0 quants in both LM Studio and EXUI.
I tried to match the parameters between both to make it fair and unbiased - flash attention on, context set to 8192, FP16 cache in Exui and no speculative decoding, gguf fully offloaded to the GPU.
I used the following prompt:
"List the first 30 elements of the periodic table, stating their atomic masses in brackets. Do it as a numbered list."
LM Studio reported ~56 t/s while EXUI ~64 t/s which makes exl2 >14% faster that gguf in this specific test.
Is this about in line with what should be expected?
My specs:
i7-14700K, 64GB DDR4 4300MHz of RAM, RTX 4070Ti Super 16GB VRAM, Windows 11 Pro.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions