Inference speed exl2 vs gguf - are my results typical? #471

LlamaEnjoyer · 2024-05-27T07:00:31Z

LlamaEnjoyer
May 27, 2024

Hi folks!

I've been toying around with LLMs for the past few weeks which became my new hobby :) I started out with LM studio, but recently I've installed Exui to see for myself if the exl2 really that awesome. Putting the hurdle-hopping to get it up and running on my Windows PC aside, I've decided to run a quick speed test using the Llama 3 8B Instruct Q8.0 quants in both LM Studio and EXUI.

I tried to match the parameters between both to make it fair and unbiased - flash attention on, context set to 8192, FP16 cache in Exui and no speculative decoding, gguf fully offloaded to the GPU.

I used the following prompt:
"List the first 30 elements of the periodic table, stating their atomic masses in brackets. Do it as a numbered list."

LM Studio reported ~56 t/s while EXUI ~64 t/s which makes exl2 >14% faster that gguf in this specific test.
Is this about in line with what should be expected?

My specs:
i7-14700K, 64GB DDR4 4300MHz of RAM, RTX 4070Ti Super 16GB VRAM, Windows 11 Pro.

Thanks!

LlamaEnjoyer · 2024-05-30T10:00:11Z

LlamaEnjoyer
May 30, 2024
Author

Got my answers in a reddit thread. Closing.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference speed exl2 vs gguf - are my results typical? #471

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Inference speed exl2 vs gguf - are my results typical? #471

LlamaEnjoyer May 27, 2024

Replies: 1 comment

LlamaEnjoyer May 30, 2024 Author

LlamaEnjoyer
May 27, 2024

LlamaEnjoyer
May 30, 2024
Author