Releases · Nexesenex/croco.cpp

LostRuins has been back to work, and major updates were made in the last weeks on LlamaCPP.
Before our benefactor publishes his new KoboldCPP official release, here's my leechy one!_

Kobold.CPP Frankenstein v1.62's beta source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :

based on GGermanov'sLlamaCPP b2628 & LostRuin's KoboldCPP Experimental version 1.62 beta.
experimental KCPP commits up to the 08/04/2024, 20h GMT+1
With SOTA 1.5 bpw (IQ1_S, IQ1_M), 2 bpw (IQ2_XXS, XS, S, M), 3 bits (IQ3_XXS, IQ3_XS, IQ3_S, IQ3_M), and 4 bits (IQ4_XS) GGUF models working as in Llama CPP b2628.
With SOTA IQ4_NL quant (for non-standard models with weird tensor shapes) working as in LlamaCPP b2628.
With Google Gemma compatibility as in LlamaCPP b2628.

Also with (untested) :

Vulkan support implemented by the devs (constantly improving version after version).
MOE speed bump by Slaren (PR ggerganov#6505 )

And with, as always :

unlocked context size (now standard in KCPP)
custom rope settings
no KCPP fragmentation cache
Lostruins seems to have sorted well the CUDA speed, and I didn't mess with anything this time : it works as intended, both with & without MMQ.

The Cublas version is compiled with Cublas 12.4.

All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.

For more information on the features of KoboldCPP 1.61.2, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.61.2

The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here and there in command line: this is for test and amusement only.

What's Changed

Sl/moe rework 2 by @Nexesenex in #104

Full Changelog: v1.59d_b2254...v1.62b_b2628

Contributors

Nexesenex

Assets 4

14 Apr 03:00

Nexesenex

v1.60b-1.62_20240316

7e8e628

Kobold.CPP_Frankenstein_v1.60b_1.62_20240316

For https://github.com/brokofankone and others, who have Cuda speed troubles with April releases of KCCP official and frankenstein.

Here's a Frankenstein internal release of mine from the 16 march 2024, to test and eventually update your february releases while retaining good performances.

The *of.exe versions (of the same date in my backups) are probably the official versions they are based on / competitive with.

I remade my dual-GPU setup, and my version of the 16 march 2024 is giving me good performances.

Assets 6

25 Feb 02:14

Nexesenex

v1.59d_b2254

87e6975

Kobold.CPP_Frankenstein_v1.59d_b2254_4x3bits_SOTA

Kobold.CPP Frankenstein v1.59's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :

based on LlamaCPP b2254 & LostRuin's KoboldCPP Experimental version 1.59 beta (my fourth internal compilation, hence 1.59c)
experimental KCPP commits up to the 24/02/2024, 15h GMT+1
With SOTA 1.5 bpw (IQ1_S), 2 bpw (IQ2_XS & XXS) and 3 bits (IQ3_XXS, Q3_K_XS_v2, IQ3_S, IQ3_M) GGUF models working as in Llama CPP b2254.
With IQ4_NL quants working as in LlamaCPP b2254.
With Google Gemma compatibility as in LlamaCPP b2254.
With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS on Windows 10/11 (untested on others cards and OS), with the fastest PP & TG speed available.

Note about CUDA speed : Thread about CUDA speed regressions (and optimizations :) : LostRuins#642

Also with (untested) :

Vulkan support implemented by the devs (constantly improving version after version).

And with, as always :

unlocked context size
custom rope settings
no KCPP fragmentation cache
benchmark feature pushed from 2k to 8k tokens.

The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.

All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.

For more information on the features of KoboldCPP 1.58, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.58

The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.

What's Changed

b2254 by @Nexesenex in #90

Full Changelog: v1.59c_b2249...v1.59d_b2254

Contributors

Nexesenex

Assets 4

25 Feb 00:24

github-actions

b2252

525213d

b2252

server: init functional tests (#5566)

* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" #3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault #5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Assets 14

23 Feb 03:21

Nexesenex

v1.59c_b2249

5ee5f85

Kobold.CPP_Frankenstein_v1.59c_b2249_Gemma

Kobold.CPP Frankenstein v1.59's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :

based on LlamaCPP b2241+ 8 commits ("b2249") & LostRuin's KoboldCPP Experimental version 1.59 beta (my third internal compilation, hence 1.59c)
experimental KCPP commits up to the 22/02/2024, 22h GMT+1
With SOTA 1.5 bpw (IQ1_S), 2 bpw (IQ2_XS & XXS) and 3 bits (IQ3_XXS) GGUF models working as in Llama CPP "b2249".
With IQ4_NL quants working as in LlamaCPP "2249".
With Google Gemma compatibility as in LlamaCPP "b2249".
With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS on Windows 10/11 (untested on others cards and OS), with the fastest PP & TG speed available.

Note about CUDA speed : Thread about CUDA speed regressions (and optimizations :) : LostRuins#642 (comment)

Also with (untested) :

Vulkan support implemented by the devs (constantly improving version after version).

And with, as always :

unlocked context size
custom rope settings
no KCPP fragmentation cache
benchmark feature pushed from 2k to 8k tokens.

The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.

All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.

For more information on the features of KoboldCPP 1.58, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.58

The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.

What's Changed

b2249 by @Nexesenex in #89

Full Changelog: v1.58.b2167_IQ1_S_V3...v1.59c_b2249

Contributors

Nexesenex

Assets 4

23 Feb 02:51

github-actions

b2234

973053d

b2234

llama : fix loading models with shared tok_embd and output (#5651)

ggml-ci

Assets 14

17 Feb 19:27

Nexesenex

v1.58.b2167_IQ1_S_V3

302edbf

Kobold.CPP_Frankenstein_v1.58_b2167_IQ1_S_V3

Kobold.CPP Frankenstein v1.58's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :

based on LlamaCPP b2167 & KoboldCPP Release version 1.58
experimental KCPP commits up to the 16/02/2024, 24h GMT+1
With SOTA 2 bits (IQ2_XS & XXS) and 3 bits (IQ3_XXS) GGUF models working as in Llama CPP b2167.
- BONUS : SOTA 1.65-1.7 bpw (IQ1_S, version 3 already) compatibility thanks to Ikawrakow's work on a LlamaCPP Demo feature ! More infos : ggerganov#5453
With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS on Windows 10/11 (untested on others cards and OS), with the fastest PP & TG speed available.

Note about CUDA speed : Thread about CUDA speed regressions (and optimizations :) : LostRuins#642 (comment)

Note about IQ1_S : perplexity jumps badly in IQ1_S (+50% minimum vs fp16, and way more for the models < 70b), and only the 70b models might be usable to a small extend, and to an even lesser extend the 33/34b.

After test, any decent 7b will do better than a 70b model in 1.5bpw like this Miqu, but nevertheless, there's some basic coherence left still and that's very impressive for such a low quant.
As for Kyllene 34b, she's really a mess in such quant lol, even if some basic hallucinatory coherence subsides on such a 34b model. At least, you can see clearly the difference between prompts formatting effect accordingly to the model. IQ1_S V2 & V3 improve this widely, without changing my statement still.

I'm "praying" for some 1.8-1.9bpw quants, which might be more usable for us than such a SOTA tech showcase, at least on 70b models and maybe on 34b as well..

Also with (untested) :

Vulkan support implemented by the devs (constantly improving version after version).
Benchmark feature pushed from 2k to 8k tokens.

And with, as always :

unlocked context size
custom rope settings
no KCPP fragmentation cache

The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.

All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.

For more information on the features of KoboldCPP 1.58, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.58

The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.

Full Changelog: 1.58_b2131_IQ1_S_v3...v1.58.b2167_IQ1_S_V3

Assets 4

17 Feb 18:39

github-actions

b2167

5bf2b94

b2167

cmake : fix VULKAN and ROCm builds (#5525)

* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci

Assets 14

12 Feb 03:43

Nexesenex

v1.57.1_b2116+1_SOTA_IQ1_S

cdafc59

Kobold.CPP_Frankenstein_v1.57.1+1_b2116_SOTA_IQ1_S_V1 (obsolete)

Kobold.CPP Frankenstein v1.57.1's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :

based on LlamaCPP b2116
experimental KCPP commits up to the 10/02/2024, 23h GMT+1, commit f9bc724
so, with Quadratic Sampling UI commit by Alexandar Abushady, on the top of Kalomaze's Quadratic Sampling feature.
and still SOTA 2 bits (IQ2_XS & XXS) and 3 bits (IQ3_XXS) GGUF models working as in Llama CPP b2116.
With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS (untested on others), which lead here, ON MY CONFIG (3090+3060), to +25% tokens generation and an equivalent prompt processing speed compared to my "already fast" release 1.57b2030 (non-quadratic, and only the .exe because the source is messed up), which is now deprecrated as all previous versions are.

Thread about CUDA speed regressions (and optimizations :) : LostRuins#642 (comment)

BONUS : SOTA 1.65-1.7 bpw (IQ1_S) compatibility thanks to Ikawrakow's work on a LlamaCPP Demo feature ! More infos : ggerganov#5453

Anyway, 2 models quantized in 1.5 bits by yours trully to play with the new feature :

Miqu 70b Requant iMat : https://huggingface.co/Nexesenex/MIstral-QUantized-70b_Miqu-1-70b-iMat.GGUF
Kyllene 34b iLat : https://huggingface.co/Nexesenex/TeeZee_Kyllene-Yi-34B-v1.1-iMat.GGUF

Note, perplexity jumps badly in IQ1_S (+50% minimum vs fp16, and way more for the models < 70b), and only the 70b models might be usable to a small extend, and to an even lesser extend the 33/34b. After test, any decent 7b will do better than a 70b model in 1.5bpw like this Miqu, but nevertheless, there's some basic coherence left still and that's very impressive for such a low quant. As for Kyllene, she's really a mess in such quant lol, even if some basic hallucinatory coherence subsides on such a 34b model. At least, you can see clearly the difference between prompts formatting effect accordingly to the model.

I'm waiting for some 1.85-1.90bpw quants, which might be more usable for us than such a SOTA tech showcase, at least on 70b models and maybe on 34b as well..

Also with (untested) :

Vulkan early support (improving over and over compared to previous versions).
Benchmark feature pushed to 8k tokens.

And with, as always :

unlocked context size
custom rope settings
no KCPP fragmentation cache

The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.

All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.

For more information on the features of KoboldCPP 1.57.1, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.57.1

The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.

What's Changed

Ik/iq1 s by @Nexesenex in #84

Full Changelog: v1.57.1_b2106...v1.57.1_b2116+1_SOTA_IQ1_S

Contributors

Nexesenex

Assets 4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

Contributors

Releases: Nexesenex/croco.cpp

Kobold.CPP_Frankenstein_v1.62.1a_b2637_fastMOE

What's Changed

Contributors

Kobold.CPP_Frankenstein_v1.62b_b2628_IQ1M_fastMOE

What's Changed

Contributors

Kobold.CPP_Frankenstein_v1.60b_1.62_20240316

Kobold.CPP_Frankenstein_v1.59d_b2254_4x3bits_SOTA

What's Changed

Contributors

b2252

Kobold.CPP_Frankenstein_v1.59c_b2249_Gemma

What's Changed

Contributors

b2234

Kobold.CPP_Frankenstein_v1.58_b2167_IQ1_S_V3

b2167

Kobold.CPP_Frankenstein_v1.57.1+1_b2116_SOTA_IQ1_S_V1 (obsolete)

What's Changed

Contributors