Releases: Nexesenex/croco.cpp
Kobold.CPP_Frankenstein_v1.62.1a_b2637_fastMOE
Update since yesterday, and since LostRuin's official KCPP 1.62.1.
- MOE boost PR by Slaren included (6 commits).
That's it.
What's Changed
- Sl/moe rework 2 bis by @Nexesenex in #106
Full Changelog: v1.62b_b2628...v1.62.1a_b2637
Kobold.CPP_Frankenstein_v1.62b_b2628_IQ1M_fastMOE
_Long time no see!
LostRuins has been back to work, and major updates were made in the last weeks on LlamaCPP.
Before our benefactor publishes his new KoboldCPP official release, here's my leechy one!_
Kobold.CPP Frankenstein v1.62's beta source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :
- based on GGermanov'sLlamaCPP b2628 & LostRuin's KoboldCPP Experimental version 1.62 beta.
- experimental KCPP commits up to the 08/04/2024, 20h GMT+1
- With SOTA 1.5 bpw (IQ1_S, IQ1_M), 2 bpw (IQ2_XXS, XS, S, M), 3 bits (IQ3_XXS, IQ3_XS, IQ3_S, IQ3_M), and 4 bits (IQ4_XS) GGUF models working as in Llama CPP b2628.
- With SOTA IQ4_NL quant (for non-standard models with weird tensor shapes) working as in LlamaCPP b2628.
- With Google Gemma compatibility as in LlamaCPP b2628.
Also with (untested) :
- Vulkan support implemented by the devs (constantly improving version after version).
- MOE speed bump by Slaren (PR ggerganov#6505 )
And with, as always :
- unlocked context size (now standard in KCPP)
- custom rope settings
- no KCPP fragmentation cache
- Lostruins seems to have sorted well the CUDA speed, and I didn't mess with anything this time : it works as intended, both with & without MMQ.
The Cublas version is compiled with Cublas 12.4.
All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.
For more information on the features of KoboldCPP 1.61.2, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.61.2
The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here and there in command line: this is for test and amusement only.
What's Changed
- Sl/moe rework 2 by @Nexesenex in #104
Full Changelog: v1.59d_b2254...v1.62b_b2628
Kobold.CPP_Frankenstein_v1.60b_1.62_20240316
For https://github.com/brokofankone and others, who have Cuda speed troubles with April releases of KCCP official and frankenstein.
Here's a Frankenstein internal release of mine from the 16 march 2024, to test and eventually update your february releases while retaining good performances.
The *of.exe versions (of the same date in my backups) are probably the official versions they are based on / competitive with.
I remade my dual-GPU setup, and my version of the 16 march 2024 is giving me good performances.
Kobold.CPP_Frankenstein_v1.59d_b2254_4x3bits_SOTA
Kobold.CPP Frankenstein v1.59's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :
-
based on LlamaCPP b2254 & LostRuin's KoboldCPP Experimental version 1.59 beta (my fourth internal compilation, hence 1.59c)
-
experimental KCPP commits up to the 24/02/2024, 15h GMT+1
-
With SOTA 1.5 bpw (IQ1_S), 2 bpw (IQ2_XS & XXS) and 3 bits (IQ3_XXS, Q3_K_XS_v2, IQ3_S, IQ3_M) GGUF models working as in Llama CPP b2254.
-
With IQ4_NL quants working as in LlamaCPP b2254.
-
With Google Gemma compatibility as in LlamaCPP b2254.
-
With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS on Windows 10/11 (untested on others cards and OS), with the fastest PP & TG speed available.
Note about CUDA speed : Thread about CUDA speed regressions (and optimizations :) : LostRuins#642
Also with (untested) :
- Vulkan support implemented by the devs (constantly improving version after version).
And with, as always :
- unlocked context size
- custom rope settings
- no KCPP fragmentation cache
- benchmark feature pushed from 2k to 8k tokens.
The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.
All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.
For more information on the features of KoboldCPP 1.58, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.58
The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.
What's Changed
- b2254 by @Nexesenex in #90
Full Changelog: v1.59c_b2249...v1.59d_b2254
b2252
server: init functional tests (#5566) * server: tests: init scenarios - health and slots endpoints - completion endpoint - OAI compatible chat completion requests w/ and without streaming - completion multi users scenario - multi users scenario on OAI compatible endpoint with streaming - multi users with total number of tokens to predict exceeds the KV Cache size - server wrong usage scenario, like in Infinite loop of "context shift" #3969 - slots shifting - continuous batching - embeddings endpoint - multi users embedding endpoint: Segmentation fault #5655 - OpenAI-compatible embeddings API - tokenize endpoint - CORS and api key scenario * server: CI GitHub workflow --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Kobold.CPP_Frankenstein_v1.59c_b2249_Gemma
Kobold.CPP Frankenstein v1.59's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :
-
based on LlamaCPP b2241+ 8 commits ("b2249") & LostRuin's KoboldCPP Experimental version 1.59 beta (my third internal compilation, hence 1.59c)
-
experimental KCPP commits up to the 22/02/2024, 22h GMT+1
-
With SOTA 1.5 bpw (IQ1_S), 2 bpw (IQ2_XS & XXS) and 3 bits (IQ3_XXS) GGUF models working as in Llama CPP "b2249".
-
With IQ4_NL quants working as in LlamaCPP "2249".
-
With Google Gemma compatibility as in LlamaCPP "b2249".
-
With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS on Windows 10/11 (untested on others cards and OS), with the fastest PP & TG speed available.
Note about CUDA speed : Thread about CUDA speed regressions (and optimizations :) : LostRuins#642 (comment)
Also with (untested) :
- Vulkan support implemented by the devs (constantly improving version after version).
And with, as always :
- unlocked context size
- custom rope settings
- no KCPP fragmentation cache
- benchmark feature pushed from 2k to 8k tokens.
The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.
All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.
For more information on the features of KoboldCPP 1.58, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.58
The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.
What's Changed
- b2249 by @Nexesenex in #89
Full Changelog: v1.58.b2167_IQ1_S_V3...v1.59c_b2249
b2234
llama : fix loading models with shared tok_embd and output (#5651) ggml-ci
Kobold.CPP_Frankenstein_v1.58_b2167_IQ1_S_V3
Kobold.CPP Frankenstein v1.58's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :
-
based on LlamaCPP b2167 & KoboldCPP Release version 1.58
-
experimental KCPP commits up to the 16/02/2024, 24h GMT+1
-
With SOTA 2 bits (IQ2_XS & XXS) and 3 bits (IQ3_XXS) GGUF models working as in Llama CPP b2167.
-
- BONUS : SOTA 1.65-1.7 bpw (IQ1_S, version 3 already) compatibility thanks to Ikawrakow's work on a LlamaCPP Demo feature ! More infos : ggerganov#5453
-
With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS on Windows 10/11 (untested on others cards and OS), with the fastest PP & TG speed available.
Note about CUDA speed : Thread about CUDA speed regressions (and optimizations :) : LostRuins#642 (comment)
Note about IQ1_S : perplexity jumps badly in IQ1_S (+50% minimum vs fp16, and way more for the models < 70b), and only the 70b models might be usable to a small extend, and to an even lesser extend the 33/34b.
- After test, any decent 7b will do better than a 70b model in 1.5bpw like this Miqu, but nevertheless, there's some basic coherence left still and that's very impressive for such a low quant.
- As for Kyllene 34b, she's really a mess in such quant lol, even if some basic hallucinatory coherence subsides on such a 34b model. At least, you can see clearly the difference between prompts formatting effect accordingly to the model. IQ1_S V2 & V3 improve this widely, without changing my statement still.
I'm "praying" for some 1.8-1.9bpw quants, which might be more usable for us than such a SOTA tech showcase, at least on 70b models and maybe on 34b as well..
Also with (untested) :
- Vulkan support implemented by the devs (constantly improving version after version).
- Benchmark feature pushed from 2k to 8k tokens.
And with, as always :
- unlocked context size
- custom rope settings
- no KCPP fragmentation cache
The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.
All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.
For more information on the features of KoboldCPP 1.58, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.58
The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.
Full Changelog: 1.58_b2131_IQ1_S_v3...v1.58.b2167_IQ1_S_V3
b2167
cmake : fix VULKAN and ROCm builds (#5525) * cmake : fix VULKAN and ROCm builds * cmake : fix (cont) * vulkan : fix compile warnings ggml-ci * cmake : fix ggml-ci * cmake : minor ggml-ci
Kobold.CPP_Frankenstein_v1.57.1+1_b2116_SOTA_IQ1_S_V1 (obsolete)
Kobold.CPP Frankenstein v1.57.1's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :
-
based on LlamaCPP b2116
-
experimental KCPP commits up to the 10/02/2024, 23h GMT+1, commit f9bc724
-
so, with Quadratic Sampling UI commit by Alexandar Abushady, on the top of Kalomaze's Quadratic Sampling feature.
-
and still SOTA 2 bits (IQ2_XS & XXS) and 3 bits (IQ3_XXS) GGUF models working as in Llama CPP b2116.
-
With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS (untested on others), which lead here, ON MY CONFIG (3090+3060), to +25% tokens generation and an equivalent prompt processing speed compared to my "already fast" release 1.57b2030 (non-quadratic, and only the .exe because the source is messed up), which is now deprecrated as all previous versions are.
Thread about CUDA speed regressions (and optimizations :) : LostRuins#642 (comment)
- BONUS : SOTA 1.65-1.7 bpw (IQ1_S) compatibility thanks to Ikawrakow's work on a LlamaCPP Demo feature ! More infos : ggerganov#5453
Anyway, 2 models quantized in 1.5 bits by yours trully to play with the new feature :
- Miqu 70b Requant iMat : https://huggingface.co/Nexesenex/MIstral-QUantized-70b_Miqu-1-70b-iMat.GGUF
- Kyllene 34b iLat : https://huggingface.co/Nexesenex/TeeZee_Kyllene-Yi-34B-v1.1-iMat.GGUF
Note, perplexity jumps badly in IQ1_S (+50% minimum vs fp16, and way more for the models < 70b), and only the 70b models might be usable to a small extend, and to an even lesser extend the 33/34b. After test, any decent 7b will do better than a 70b model in 1.5bpw like this Miqu, but nevertheless, there's some basic coherence left still and that's very impressive for such a low quant. As for Kyllene, she's really a mess in such quant lol, even if some basic hallucinatory coherence subsides on such a 34b model. At least, you can see clearly the difference between prompts formatting effect accordingly to the model.
I'm waiting for some 1.85-1.90bpw quants, which might be more usable for us than such a SOTA tech showcase, at least on 70b models and maybe on 34b as well..
Also with (untested) :
- Vulkan early support (improving over and over compared to previous versions).
- Benchmark feature pushed to 8k tokens.
And with, as always :
- unlocked context size
- custom rope settings
- no KCPP fragmentation cache
The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.
All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.
For more information on the features of KoboldCPP 1.57.1, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.57.1
The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.
What's Changed
- Ik/iq1 s by @Nexesenex in #84
Full Changelog: v1.57.1_b2106...v1.57.1_b2116+1_SOTA_IQ1_S