Kobold.CPP_Frankenstein_v1.59c_b2249_Gemma
Kobold.CPP Frankenstein v1.59's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :
-
based on LlamaCPP b2241+ 8 commits ("b2249") & LostRuin's KoboldCPP Experimental version 1.59 beta (my third internal compilation, hence 1.59c)
-
experimental KCPP commits up to the 22/02/2024, 22h GMT+1
-
With SOTA 1.5 bpw (IQ1_S), 2 bpw (IQ2_XS & XXS) and 3 bits (IQ3_XXS) GGUF models working as in Llama CPP "b2249".
-
With IQ4_NL quants working as in LlamaCPP "2249".
-
With Google Gemma compatibility as in LlamaCPP "b2249".
-
With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS on Windows 10/11 (untested on others cards and OS), with the fastest PP & TG speed available.
Note about CUDA speed : Thread about CUDA speed regressions (and optimizations :) : LostRuins#642 (comment)
Also with (untested) :
- Vulkan support implemented by the devs (constantly improving version after version).
And with, as always :
- unlocked context size
- custom rope settings
- no KCPP fragmentation cache
- benchmark feature pushed from 2k to 8k tokens.
The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.
All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.
For more information on the features of KoboldCPP 1.58, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.58
The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.
What's Changed
- b2249 by @Nexesenex in #89
Full Changelog: v1.58.b2167_IQ1_S_V3...v1.59c_b2249