Skip to content

Releases: Nexesenex/croco.cpp

Kobold.CPP_Frankenstein_v1.62.1a_b2637_fastMOE

09 Apr 21:28
b1be1c4
Compare
Choose a tag to compare

Update since yesterday, and since LostRuin's official KCPP 1.62.1.

  • MOE boost PR by Slaren included (6 commits).

That's it.

What's Changed

Full Changelog: v1.62b_b2628...v1.62.1a_b2637

Kobold.CPP_Frankenstein_v1.62b_b2628_IQ1M_fastMOE

08 Apr 20:32
Compare
Choose a tag to compare

_Long time no see!

LostRuins has been back to work, and major updates were made in the last weeks on LlamaCPP.
Before our benefactor publishes his new KoboldCPP official release, here's my leechy one!_

Kobold.CPP Frankenstein v1.62's beta source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :

  • based on GGermanov'sLlamaCPP b2628 & LostRuin's KoboldCPP Experimental version 1.62 beta.
  • experimental KCPP commits up to the 08/04/2024, 20h GMT+1
  • With SOTA 1.5 bpw (IQ1_S, IQ1_M), 2 bpw (IQ2_XXS, XS, S, M), 3 bits (IQ3_XXS, IQ3_XS, IQ3_S, IQ3_M), and 4 bits (IQ4_XS) GGUF models working as in Llama CPP b2628.
  • With SOTA IQ4_NL quant (for non-standard models with weird tensor shapes) working as in LlamaCPP b2628.
  • With Google Gemma compatibility as in LlamaCPP b2628.

Also with (untested) :

  • Vulkan support implemented by the devs (constantly improving version after version).
  • MOE speed bump by Slaren (PR ggerganov#6505 )

And with, as always :

  • unlocked context size (now standard in KCPP)
  • custom rope settings
  • no KCPP fragmentation cache
  • Lostruins seems to have sorted well the CUDA speed, and I didn't mess with anything this time : it works as intended, both with & without MMQ.

The Cublas version is compiled with Cublas 12.4.

All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.

For more information on the features of KoboldCPP 1.61.2, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.61.2

The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here and there in command line: this is for test and amusement only.

What's Changed

Full Changelog: v1.59d_b2254...v1.62b_b2628

Kobold.CPP_Frankenstein_v1.60b_1.62_20240316

14 Apr 03:00
Compare
Choose a tag to compare

For https://github.com/brokofankone and others, who have Cuda speed troubles with April releases of KCCP official and frankenstein.

Here's a Frankenstein internal release of mine from the 16 march 2024, to test and eventually update your february releases while retaining good performances.

The *of.exe versions (of the same date in my backups) are probably the official versions they are based on / competitive with.

I remade my dual-GPU setup, and my version of the 16 march 2024 is giving me good performances.

Kobold.CPP_Frankenstein_v1.59d_b2254_4x3bits_SOTA

25 Feb 02:14
Compare
Choose a tag to compare

Kobold.CPP Frankenstein v1.59's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :

  • based on LlamaCPP b2254 & LostRuin's KoboldCPP Experimental version 1.59 beta (my fourth internal compilation, hence 1.59c)

  • experimental KCPP commits up to the 24/02/2024, 15h GMT+1

  • With SOTA 1.5 bpw (IQ1_S), 2 bpw (IQ2_XS & XXS) and 3 bits (IQ3_XXS, Q3_K_XS_v2, IQ3_S, IQ3_M) GGUF models working as in Llama CPP b2254.

  • With IQ4_NL quants working as in LlamaCPP b2254.

  • With Google Gemma compatibility as in LlamaCPP b2254.

  • With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS on Windows 10/11 (untested on others cards and OS), with the fastest PP & TG speed available.

Note about CUDA speed : Thread about CUDA speed regressions (and optimizations :) : LostRuins#642

Also with (untested) :

  • Vulkan support implemented by the devs (constantly improving version after version).

And with, as always :

  • unlocked context size
  • custom rope settings
  • no KCPP fragmentation cache
  • benchmark feature pushed from 2k to 8k tokens.

The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.

All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.

For more information on the features of KoboldCPP 1.58, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.58

The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.

What's Changed

Full Changelog: v1.59c_b2249...v1.59d_b2254

b2252

25 Feb 00:24
525213d
Compare
Choose a tag to compare
server: init functional tests (#5566)

* server: tests: init scenarios
 - health and slots endpoints
 - completion endpoint
 - OAI compatible chat completion requests w/ and without streaming
 - completion multi users scenario
 - multi users scenario on OAI compatible endpoint with streaming
 - multi users with total number of tokens to predict exceeds the KV Cache size
 - server wrong usage scenario, like in Infinite loop of "context shift" #3969
 - slots shifting
 - continuous batching
 - embeddings endpoint
 - multi users embedding endpoint: Segmentation fault #5655
 - OpenAI-compatible embeddings API
 - tokenize endpoint
 - CORS and api key scenario

* server: CI GitHub workflow


---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Kobold.CPP_Frankenstein_v1.59c_b2249_Gemma

23 Feb 03:21
Compare
Choose a tag to compare

Kobold.CPP Frankenstein v1.59's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :

  • based on LlamaCPP b2241+ 8 commits ("b2249") & LostRuin's KoboldCPP Experimental version 1.59 beta (my third internal compilation, hence 1.59c)

  • experimental KCPP commits up to the 22/02/2024, 22h GMT+1

  • With SOTA 1.5 bpw (IQ1_S), 2 bpw (IQ2_XS & XXS) and 3 bits (IQ3_XXS) GGUF models working as in Llama CPP "b2249".

  • With IQ4_NL quants working as in LlamaCPP "2249".

  • With Google Gemma compatibility as in LlamaCPP "b2249".

  • With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS on Windows 10/11 (untested on others cards and OS), with the fastest PP & TG speed available.

Note about CUDA speed : Thread about CUDA speed regressions (and optimizations :) : LostRuins#642 (comment)

Also with (untested) :

  • Vulkan support implemented by the devs (constantly improving version after version).

And with, as always :

  • unlocked context size
  • custom rope settings
  • no KCPP fragmentation cache
  • benchmark feature pushed from 2k to 8k tokens.

The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.

All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.

For more information on the features of KoboldCPP 1.58, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.58

The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.

What's Changed

Full Changelog: v1.58.b2167_IQ1_S_V3...v1.59c_b2249

b2234

23 Feb 02:51
973053d
Compare
Choose a tag to compare
llama : fix loading models with shared tok_embd and output (#5651)

ggml-ci

Kobold.CPP_Frankenstein_v1.58_b2167_IQ1_S_V3

17 Feb 19:27
Compare
Choose a tag to compare

Kobold.CPP Frankenstein v1.58's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :

  • based on LlamaCPP b2167 & KoboldCPP Release version 1.58

  • experimental KCPP commits up to the 16/02/2024, 24h GMT+1

  • With SOTA 2 bits (IQ2_XS & XXS) and 3 bits (IQ3_XXS) GGUF models working as in Llama CPP b2167.

    • BONUS : SOTA 1.65-1.7 bpw (IQ1_S, version 3 already) compatibility thanks to Ikawrakow's work on a LlamaCPP Demo feature ! More infos : ggerganov#5453
  • With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS on Windows 10/11 (untested on others cards and OS), with the fastest PP & TG speed available.

Note about CUDA speed : Thread about CUDA speed regressions (and optimizations :) : LostRuins#642 (comment)

Note about IQ1_S : perplexity jumps badly in IQ1_S (+50% minimum vs fp16, and way more for the models < 70b), and only the 70b models might be usable to a small extend, and to an even lesser extend the 33/34b.

  • After test, any decent 7b will do better than a 70b model in 1.5bpw like this Miqu, but nevertheless, there's some basic coherence left still and that's very impressive for such a low quant.
  • As for Kyllene 34b, she's really a mess in such quant lol, even if some basic hallucinatory coherence subsides on such a 34b model. At least, you can see clearly the difference between prompts formatting effect accordingly to the model. IQ1_S V2 & V3 improve this widely, without changing my statement still.

I'm "praying" for some 1.8-1.9bpw quants, which might be more usable for us than such a SOTA tech showcase, at least on 70b models and maybe on 34b as well..

Also with (untested) :

  • Vulkan support implemented by the devs (constantly improving version after version).
  • Benchmark feature pushed from 2k to 8k tokens.

And with, as always :

  • unlocked context size
  • custom rope settings
  • no KCPP fragmentation cache

The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.

All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.

For more information on the features of KoboldCPP 1.58, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.58

The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.

Full Changelog: 1.58_b2131_IQ1_S_v3...v1.58.b2167_IQ1_S_V3

b2167

17 Feb 18:39
5bf2b94
Compare
Choose a tag to compare
cmake : fix VULKAN and ROCm builds (#5525)

* cmake : fix VULKAN and ROCm builds

* cmake : fix (cont)

* vulkan : fix compile warnings

ggml-ci

* cmake : fix

ggml-ci

* cmake : minor

ggml-ci

Kobold.CPP_Frankenstein_v1.57.1+1_b2116_SOTA_IQ1_S_V1 (obsolete)

12 Feb 03:43
Compare
Choose a tag to compare

Kobold.CPP Frankenstein v1.57.1's source and .exe for Windows built with Openblas/Clblast/Vulkan (small .exe), and the same + Cublas (big .exe) :

  • based on LlamaCPP b2116

  • experimental KCPP commits up to the 10/02/2024, 23h GMT+1, commit f9bc724

  • so, with Quadratic Sampling UI commit by Alexandar Abushady, on the top of Kalomaze's Quadratic Sampling feature.

  • and still SOTA 2 bits (IQ2_XS & XXS) and 3 bits (IQ3_XXS) GGUF models working as in Llama CPP b2116.

  • With LostRuins's work on CUDA speed regression sorted out for my needs, so for AMPERE CARDS (untested on others), which lead here, ON MY CONFIG (3090+3060), to +25% tokens generation and an equivalent prompt processing speed compared to my "already fast" release 1.57b2030 (non-quadratic, and only the .exe because the source is messed up), which is now deprecrated as all previous versions are.

Thread about CUDA speed regressions (and optimizations :) : LostRuins#642 (comment)

  • BONUS : SOTA 1.65-1.7 bpw (IQ1_S) compatibility thanks to Ikawrakow's work on a LlamaCPP Demo feature ! More infos : ggerganov#5453

Anyway, 2 models quantized in 1.5 bits by yours trully to play with the new feature :

Note, perplexity jumps badly in IQ1_S (+50% minimum vs fp16, and way more for the models < 70b), and only the 70b models might be usable to a small extend, and to an even lesser extend the 33/34b. After test, any decent 7b will do better than a 70b model in 1.5bpw like this Miqu, but nevertheless, there's some basic coherence left still and that's very impressive for such a low quant. As for Kyllene, she's really a mess in such quant lol, even if some basic hallucinatory coherence subsides on such a 34b model. At least, you can see clearly the difference between prompts formatting effect accordingly to the model.

I'm waiting for some 1.85-1.90bpw quants, which might be more usable for us than such a SOTA tech showcase, at least on 70b models and maybe on 34b as well..

Also with (untested) :

  • Vulkan early support (improving over and over compared to previous versions).
  • Benchmark feature pushed to 8k tokens.

And with, as always :

  • unlocked context size
  • custom rope settings
  • no KCPP fragmentation cache

The Cublas version is compiled with Cublas 12.3. And it's real fast on Ampere.

All credits go to LostRuins who develops tirelessly KoboldCPP, to the other devs who brought features to KCPP, and to the devs of LlamaCPP.

For more information on the features of KoboldCPP 1.57.1, it's here : https://github.com/LostRuins/koboldcpp/releases/tag/v1.57.1

The Frankenstein versions of KoboldCPP released here are not supported by LostRuins, nor is the unlocked context size provided here for test and amusement only.

What's Changed

Full Changelog: v1.57.1_b2106...v1.57.1_b2116+1_SOTA_IQ1_S