Skip to content

Releases: Nexesenex/croco.cpp

Kobold.CPP_FrankenFork_v1.70102_b3398+4

15 Jul 19:14
Compare
Choose a tag to compare

Frankenstein 1.70102 "Fork" of KoboldCPP official experimental up to the 15/07/2024, 20h GMT+2, with KLite 1.56 up to the 15/07/2024, 20h GMT+2.
Based on Llama.CPP b3398 +4 pertinent LCPP commits/PRs.
I'm only attentive on the Cuda side of things. The rest might work, or not.

Unroll DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps in the following way x.xxx : (KCPP) xx (KCPP-F).
They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN TRICKS:

  • Vast choice of context-size in the GUI slider, with 512 steps, up to 262k context, as well as BlastBatchSize from 1 to 4096.
  • Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option. Now integrated in a slightly revamped version in the official version. Now aligned on LCPP and fixed in its caclulations.
  • A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the L1/L2 rope, as well as a positive offset for SOLAR models, and improve the perplexity (L1,L2, Solar) or avoid to degrade too much the reasoning abilities (L3, not implemented yet) at equal context.
  • More chat adapters, on the top of those provided on official.
  • A slight deboost on the pipeline parallelization, set from 4 to 2. 0.5-1% VRAM saved, and less stress on the graphic cards.
  • Full chat-window width for the text when you zoom out in the Corpo theme.
  • 8 chats saving slots instead of 6.
Unroll the 26 KV cache options (all should be considered experimental except F16, KV Q8_0, and KV Q4_0)

With Flash Attention :

  • F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
  • K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
  • K Q8_0 with : V F16, Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0 (maybe stable)
  • K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
  • K Q5_0 with : V Q5_0, Q4_1, V Q4_0
  • K Q4_1 with : V Q4_1 (stable), Q4_0 (maybe stable)
  • KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
    Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

Without Flash Attention nor MMQ (for models like Gemma) :

  • V F16 with KQ8_0, Q5_1, Q5_0, Q4_1, and Q4_0.

FRANKENSTEIN integrates looted PRs :

ARGUMENTS 👍(to be edited, check them in CLI or use the GUI)

Note : I had to use a simple 0-20 numbering scheme to allow the GUI and the kcpps preset saving to work properly with KVQ26. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

Unroll the options to set KV Quants :

0 = 1616/F16 (16 BPW), - present on KV3modes releases as well

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW), - present on KV3modes releases as well
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),
11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),
15 = 5050/Kq5_0-Vq5_0 (5.5BPW),
16 = 5041/Kq5_0-Vq4_1 (5.25BPW),
17 = 5040/Kq5_0-Vq4_0 (5BPW),
18 = 4141/Kq4_1-Vq4_1 (5BPW),
19 = 4140/Kq4_1-Vq4_0 (4.75BPW),
20 = 4040/KVq4_0 (4.5BPW) - present on KV3modes releases as well

21 = 1616/F16 (16 BPW), - present on KV3modes releases as well (same as 0, I just used it for the GUI slider).

22 = 8016/Kq8_0, Vf16 (12.25BPW), FA and no-FA both

23 = 5116/Kq5_1-Vf16 (11BPW), no-FA
24 = 5016/Kq5_1-Vf16 (10.75BPW), no-FA
25 = 4116/Kq4_1-Vf16 (10.50BPW), no-FA
26 = 4016/Kq4_0-Vf16 (10.25BPW), no-FA

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26], default=0)

Note : Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

Note : context shift doesn't seem to work with K_cache without FA either. But Smartcontext does!

REMARKS :

You MUST use Flash attention for anything else than QKV=0 (F16)
(tag : --flashattention in CL, or in the GUI)

Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

BlasBatchSize 512 is still optimal, 256 still the best compromise, but 128 is a saavy compromise and is now used by default.
64 is perfectly usable and optimal for VRAM-limited scenarios. 32/16 work also, but slower. 8 MMVQ is worth 16 in Cublas mode, 4, 2 and 1 are MMVQ but very slow as you can imagine.
In MMQ, BBS 128 is as fast as Cublas 512, and this kind of delta applies also to every values below. It will become default very soon.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

Unroll the ARCHS and BUILDS

Archs :
# 37 == CUDA 11 standard for Kepler
# 52 == lowest CUDA 12 standard, for Maxwell
# 60 == f16 CUDA intrinsics
# 61 == integer CUDA intrinsics
# 70 == (assumed) compute capability at which unrolling a loop in mul_mat_q kernels is faster
# 75 == int8 tensor cores

Builds :

  • Cublas 12.2 Win (arch 60 61 70 75) : Works Ada Lovelace, Ampere and Turing. It can work with Pascal or more recent as well, and has CUDA F16 activated during compilation.
  • Cublas 12.2 Win (arch 52 61 70 75) and 12.1 Linux : Works for Maxwell v2 up to Ada, uses Integer Cuda Intrinsic.
  • Cublas 11.4.4 Win / 11.5 Linux ("KepMax", arch 35 37 50 52 for Windows, 37 52 61 70 75 for Linux) : needed for Kepler and should work also on Maxwell. (experimental, tell me if that combo does work or not, if not, I'll come back on 37 52 61 70 75 for Cuda 11).
  • the standard one is including only OpenBLAS, CLBLAST, and Vulkan support provided by the devs.

Full Changelog: v1.70001_b3389+3...v1.70102_b3398+4

Kobold.CPP_FrankenFork_v1.68t_b3230-3+2

26 Jun 04:29
Compare
Choose a tag to compare

Frankenstein 1.68t "Fork" of KoboldCPP Experimental up to the 25/06/2024, 23h GMT+0.
Based on Llama.CPP b3230, and aimed mainly at Ampere and Ada GPUs users.

Rebased on internal version 1.68o due to conflicts and problems out of some recent refactors on LCPP.
In its Cuda 12 version, probably the fastest series of KCCP-F ever released in terms of prompt processing.

DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps. They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN FEATURES:

  • Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option. Now integrated in a slightly revamped version in the official version.

  • 21 KV cache options (all should be considered experimental except F16 and KV Q8_0) 👍

F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
K Q8_0 with : V Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0 (maybe stable)
K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
K Q5_0 with : V Q5_0, Q4_1, V Q4_0
K Q4_1 with : V Q4_1 (stable), Q4_0 (maybe stable)
KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

  • A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the L1/L2 rope, as well as a positive offset for SOLAR models, and improve the perplexity (L1,L2, Solar) or avoid to degrade too much the reasoning abilities (L3, not implemented yet) at equal context.

  • Faster PP, in both Cublas and MMQ, thanks to Johannes Gaessler work on these kernels, and a compilation with Cuda Arch=75 on the top of the already recent 60,61,70 combo. Blasbatchsize 512 is still optimal, but 256 is not far from it and is now used by default. 64 is perfectly usable and optimal for VRAM-limited scenarios.

  • A slight deboost on the pipeline parallelization, set from 4 to 2. 0.5-1% VRAM saved, and less stress on the graphic cards.

  • Bitnet PR integrated, works in CPU mode (only recent models Bitnet properly converted work, older ones do not).
    Example : https://huggingface.co/BoscoTheDog/bitnet_b1_58-xl_q8_0_gguf/tree/main

ARGUMENTS 👍(to be edited, check them in CLI or use the GUI)

Note : I had to use a simple 0-20 numbering scheme to allow the GUI and the kcpps preset saving to work properly. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

0 = 1616/F16 (16 BPW),

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW),
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),

11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),

15 = 5050/Kq5_0-Vq5_0 (5.5BPW),
16 = 5041/Kq5_0-Vq4_1 (5.25BPW),
17 = 5040/Kq5_0-Vq4_0 (5BPW),

18 = 4141/Kq4_1-Vq4_1 (5BPW),
19 = 4140/Kq4_1-Vq4_0 (4.75BPW),
20 = 4040/KVq4_0 (4.5BPW)

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], default=0)

Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

REMARKS :
You MUST use Flash attention for anything else than QKV=0 (F16)
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

BUILDS :
All builds, aka. Cublas 12.1/12.2 (recommended for Ada Lovelace and Ampere for additional PP speed), Cublas 11.4.4/11.7 (more pertinent than the 12.2 for Pascal, Maxwell.. And 11.4.4 for Kepler), and the standard one are including OpenBLAS, CLBLAST, and Vulkan support provided by the devs.

What's Changed

Full Changelog: v1.68p_b3203+2...v1.68t_b3230-3+2

Kobold.CPP_FrankenFork_v1.68s_b3207+3

24 Jun 18:30
Compare
Choose a tag to compare

Frankenstein 1.68s "Fork" of KoboldCPP Experimental up to the 24/06/2024, 18h GMT+2.
Based on Llama.CPP b3207, and aimed mainly at Ampere and Ada GPUs users.

Based on Milestone version 1.68r : In its Cuda version, probably the fastest KCCP-F ever released in terms of prompt processing.

DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps. They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN FEATURES:

  • Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option. Now integrated in a slightly revamped version in the official version.

  • 21 KV cache options (all should be considered experimental except F16 and KV Q8_0) 👍

F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
K Q8_0 with : V Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0 (maybe stable)
K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
K Q5_0 with : V Q5_0, Q4_1, V Q4_0
K Q4_1 with : V Q4_1 (stable), Q4_0 (maybe stable)
KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

  • A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the L1/L2 rope, as well as a positive offset for SOLAR models, and improve the perplexity (L1,L2, Solar) or avoid to degrade too much the reasoning abilities (L3, not implemented yet) at equal context.

  • Faster PP, in both Cublas and MMQ, thanks to Johannes Gaessler work on these kernels, and a compilation with Cuda Arch=75 on the top of the already recent 60,61,70 combo. Blasbatchsize 512 is still optimal, but 256 is not far from it and is now used by default. 64 is perfectly usable and optimal for VRAM-limited scenarios.

  • A slight deboost on the pipeline parallelization, set from 4 to 2. 0.5-1% VRAM saved, and less stress on the graphic cards.

  • Bitnet PR integrated, works in CPU mode (only recent models Bitnet properly converted work, older ones do not).
    Example : https://huggingface.co/BoscoTheDog/bitnet_b1_58-xl_q8_0_gguf/tree/main

ARGUMENTS 👍(to be edited, check them in CLI or use the GUI)

Note : I had to use a simple 0-20 numbering scheme to allow the GUI and the kcpps preset saving to work properly. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

0 = 1616/F16 (16 BPW),

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW),
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),

11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),

15 = 5050/Kq5_0-Vq5_0 (5.5BPW),
16 = 5041/Kq5_0-Vq4_1 (5.25BPW),
17 = 5040/Kq5_0-Vq4_0 (5BPW),

18 = 4141/Kq4_1-Vq4_1 (5BPW),
19 = 4140/Kq4_1-Vq4_0 (4.75BPW),
20 = 4040/KVq4_0 (4.5BPW)

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], default=0)

Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

REMARKS :
You MUST use Flash attention for anything else than QKV=0 (F16)
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

BUILDS :
All builds, aka. Cublas 12.1/12.2 (recommended for Ada Lovelace and Ampere for additional PP speed), Cublas 11.4.4/11.7 (more pertinent than the 12.2 for Pascal, Maxwell.. And 11.4.4 for Kepler), and the standard one are including OpenBLAS, CLBLAST, and Vulkan support provided by the devs.

Full Changelog: v1.68r_3207+2...v1.68s_3207+3

Kobold.CPP_FrankenFork_v1.68r_b3207+2

23 Jun 22:20
Compare
Choose a tag to compare

Frankenstein 1.68r "Fork" of KoboldCPP Experimental up to the 23/06/2024, 22h GMT+2.
Based on Llama.CPP b3207, and aimed mainly at Ampere and Ada GPUs users.

Milestone version : In its Cuda version, probably the fastest KCCP-F ever released in terms of prompt processing.

DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps. They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN FEATURES:

  • Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option. Now integrated in a slightly revamped version in the official version.

  • 21 KV cache options (all should be considered experimental except F16 and KV Q8_0) 👍

F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
K Q8_0 with : V Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0 (maybe stable)
K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
K Q5_0 with : V Q5_0, Q4_1, V Q4_0
K Q4_1 with : V Q4_1 (stable), Q4_0 (maybe stable)
KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

  • A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the L1/L2 rope, as well as a positive offset for SOLAR models, and improve the perplexity (L1,L2, Solar) or avoid to degrade too much the reasoning abilities (L3, not implemented yet) at equal context.

  • Faster PP, in both Cublas and MMQ, thanks to Johannes Gaessler work on these kernels, and a compilation with Cuda Arch=75 on the top of the already recent 60,61,70 combo. Blasbatchsize 512 is still optimal, but 256 is not far from it and is now used by default. 64 is perfectly usable and optimal for VRAM-limited scenarios.

  • A slight deboost on the pipeline parallelization, set from 4 to 2. 0.5-1% VRAM saved, and less stress on the graphic cards.

ARGUMENTS 👍(to be edited, check them in CLI or use the GUI)

Note : I had to use a simple 0-20 numbering scheme to allow the GUI and the kcpps preset saving to work properly. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

0 = 1616/F16 (16 BPW),

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW),
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),

11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),

15 = 5050/Kq5_0-Vq5_0 (5.5BPW),
16 = 5041/Kq5_0-Vq4_1 (5.25BPW),
17 = 5040/Kq5_0-Vq4_0 (5BPW),

18 = 4141/Kq4_1-Vq4_1 (5BPW),
19 = 4140/Kq4_1-Vq4_0 (4.75BPW),
20 = 4040/KVq4_0 (4.5BPW)

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20], default=0)

Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

REMARKS :
You MUST use Flash attention for anything else than QKV=0 (F16)
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

BUILDS :
All builds, aka. Cublas 12.1/12.2 (recommended for Ada Lovelace and Ampere for additional PP speed), Cublas 11.4.4/11.7 (more pertinent than the 12.2 for Pascal, Maxwell.. And 11.4.4 for Kepler), and the standard one are including OpenBLAS, CLBLAST, and Vulkan support provided by the devs.

What's Changed

Full Changelog: v1.68k_b3184+1...v1.68r_3207+2

Kobold.CPP_FrankenFork_v1.68p_b3203+2

22 Jun 23:11
Compare
Choose a tag to compare

Frankenstein 1.68p "Fork" of KoboldCPP Experimental up to the 22/06/2024, 22h GMT+2.
Based on Llama.CPP b3203.

DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps. They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN FEATURES:

  • Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option. Now integrated in a slightly revamped version in the official version.

  • 18 KV cache options (all should be considered experimental except F16 and KV Q8_0) 👍

F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
K Q8_0 with : V Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0
K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
K Q5_0 with : V Q4_0
K Q4_1 with : V Q4_0
KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the rope, as well as a positive offset for SOLAR models, and improve the perplexity (L1,L2) or avoid to degrade too much the reasoning abilities (L3) at equal context.

ARGUMENTS 👍

Note : I had to use a simple 0-17 numbering scheme to allow the GUI and the kcpps preset saving to work properly. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

0 = 1616/F16 (16 BPW),

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW),
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),

11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),

15 = 5040/Kq5_0-Vq4_0 (5BPW),

16 = 4140/Kq4_1-Vq4_0 (4.75BPW),

17 = 4040/KVq4_0 (4.5BPW)

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17], default=0)

Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

REMARKS :
You MUST use Flash attention
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

BUILDS :
All builds, aka. Cublas 12.1/12.2 (recommended for Ada Lovelace and Ampere for additional PP speed), Cublas 11.4.4/11.7 (more pertinent than the 12.2 for Pascal, Maxwell.. And 11.4.4 for Kepler), and the standard one are including OpenBLAS, CLBLAST, and Vulkan support provided by the devs.

What's Changed beyond b_3203

  • Last MMQ Int8 PR of Johannes Gaessler

Full Changelog: v1.68k_b3184+1...v1.68p_b3203+2

Kobold.CPP_FrankenFork_v1.68k_b3184+1

19 Jun 17:26
Compare
Choose a tag to compare

Frankenstein 1.68k "Fork" of KoboldCPP Experimental up to the 19/06/2024, 16h GMT+2.
Based on Llama.CPP b3184.

DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps. They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN FEATURES:

  • Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option. Now integrated in a slightly revamped version in the official version.

  • 18 KV cache options (all should be considered experimental except F16 and KV Q8_0) 👍

F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
K Q8_0 with : V Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0
K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
K Q5_0 with : V Q4_0
K Q4_1 with : V Q4_0
KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the rope, as well as a positive offset for SOLAR models, and improve the perplexity (L1,L2) or avoid to degrade too much the reasoning abilities (L3) at equal context.

ARGUMENTS 👍

Note : I had to use a simple 0-17 numbering scheme to allow the GUI and the kcpps preset saving to work properly. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

0 = 1616/F16 (16 BPW),

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW),
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),

11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),

15 = 5040/Kq5_0-Vq4_0 (5BPW),

16 = 4140/Kq4_1-Vq4_0 (4.75BPW),

17 = 4040/KVq4_0 (4.5BPW)

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17], default=0)

Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

REMARKS :
You MUST use Flash attention
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

BUILDS :
All builds, aka. Cublas 12.1/12.2 (recommended for Ada Lovelace and Ampere for additional PP speed), Cublas 11.4.4/11.7 (more pertinent than the 12.2 for Pascal, Maxwell.. And 11.4.4 for Kepler), and the standard one are including OpenBLAS, CLBLAST, and Vulkan support provided by the devs.

What's Changed beyond b_3184

AVX IQ Quants PR 7845 by netrunnereve

Full Changelog: v1.68g_b3158+4...v1.68k_b3184+1

Kobold.CPP_FrankenFork_v1.68g_b3158+4

16 Jun 17:12
Compare
Choose a tag to compare

Frankenstein 1.68g "Fork" of KoboldCPP Experimental up to the 16/06/2024, 16h GMT+2.
Based on Llama.CPP b3158.

DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps. They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN FEATURES:

Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option)

18 KV cache options (all should be considered experimental except F16 and KV Q8_0) 👍

F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
K Q8_0 with : V Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0
K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
K Q5_0 with : V Q4_0
K Q4_1 with : V Q4_0
KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the rope, as well as a positive offset for SOLAR models, and improve the perplexity (L1,L2) or avoid to degrade too much the reasoning abilities (L3) at equal context.

ARGUMENTS 👍

Note : I had to use a simple 0-17 numbering scheme to allow the GUI and the kcpps preset saving to work properly. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

0 = 1616/F16 (16 BPW),

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW),
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),

11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),

15 = 5040/Kq5_0-Vq4_0 (5BPW),

16 = 4140/Kq4_1-Vq4_0 (4.75BPW),

17 = 4040/KVq4_0 (4.5BPW)

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17], default=0)

Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

REMARKS :
You MUST use Flash attention
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

BUILDS :
All builds, aka. Cublas 12.1/12.2 (recommended for Ada Lovelace and Ampere for additional PP speed), Cublas 11.4.4/11.7 (more pertinent than the 12.2 for Pascal, Maxwell.. And 11.4.4 for Kepler), and the standard one are including OpenBLAS, CLBLAST, and Vulkan support provided by the devs.
What's Changed

Direct I/O and Transparent HugePages by @pavelfatin in https://github.com/ggerganov/llama.cpp/pull/7420
GradientAI Auto ROPE Base calculation by @askmyteapot in https://github.com/LostRuins/koboldcpp/pull/910
b3141 + Fix unset main device by @bashbaug in https://github.com/ggerganov/llama.cpp/pull/7909
https://github.com/Nexesenex/kobold.cpp/commit/76d66ee0be91e2bec93206e821ee1db8d023cff5 by Joahnnes Gaessler

Kobold.CPP_FrankenFork_v1.68f_b3142+4

14 Jun 23:18
Compare
Choose a tag to compare

Frankenstein 1.68f "Fork" of KoboldCPP Experimental up to the 14/06/2024, 20h GMT+2.
Based on Llama.CPP b3142.

DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps. They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN FEATURES:

Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option)

18 KV cache options (all should be considered experimental except F16 and KV Q8_0) 👍

F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
K Q8_0 with : V Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0
K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
K Q5_0 with : V Q4_0
K Q4_1 with : V Q4_0
KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the rope, as well as a positive offset for SOLAR models, and improve the perplexity (L1,L2) or avoid to degrade too much the reasoning abilities (L3) at equal context.

ARGUMENTS 👍

Note : I had to use a simple 0-17 numbering scheme to allow the GUI and the kcpps preset saving to work properly. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

0 = 1616/F16 (16 BPW),

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW),
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),

11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),

15 = 5040/Kq5_0-Vq4_0 (5BPW),

16 = 4140/Kq4_1-Vq4_0 (4.75BPW),

17 = 4040/KVq4_0 (4.5BPW)

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17], default=0)

Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

REMARKS :
You MUST use Flash attention
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

BUILDS :
All builds, aka. Cublas 12.1/12.2 (recommended for Ada Lovelace and Ampere for additional PP speed), Cublas 11.4.4/11.7 (more pertinent than the 12.2 for Pascal, Maxwell.. And 11.4.4 for Kepler), and the standard one are including OpenBLAS, CLBLAST, and Vulkan support provided by the devs.
What's Changed

Direct I/O and Transparent HugePages by @pavelfatin in https://github.com/ggerganov/llama.cpp/pull/7420
GradientAI Auto ROPE Base calculation by @askmyteapot in https://github.com/LostRuins/koboldcpp/pull/910
b3141 + Fix unset main device by @bashbaug in https://github.com/ggerganov/llama.cpp/pull/7909
https://github.com/Nexesenex/kobold.cpp/commit/76d66ee0be91e2bec93206e821ee1db8d023cff5 by Joahnnes Gaessler

Full Changelog: v1.68d_b3141+3...v1.68f_b3142+4

Kobold.CPP_FrankenFork_v1.68d_b3141+3_GradientRope

13 Jun 02:40
Compare
Choose a tag to compare

Frankenstein 1.68d "Fork" of KoboldCPP Experimental up to the 13/06/2024, 00h GMT+2.
Based on Llama.CPP b3141.

DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps. They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN FEATURES:

Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option)

18 KV cache options (all should be considered experimental except F16 and KV Q8_0) 👍

F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
K Q8_0 with : V Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0
K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
K Q5_0 with : V Q4_0
K Q4_1 with : V Q4_0
KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

A better Autorope thanks to askmyteapot's PR on KoboldCPP official github, and, for Llama models, with an additional negative offset to lower a bit the rope and improve the perplexity (L1,L2) or avoid to degrade too much the reasoning abilities (L3) at equal context.

ARGUMENTS 👍

Note : I had to use a simple 0-17 numbering scheme to allow the GUI and the kcpps preset saving to work properly. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

0 = 1616/F16 (16 BPW),

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW),
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),

11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),

15 = 5040/Kq5_0-Vq4_0 (5BPW),

16 = 4140/Kq4_1-Vq4_0 (4.75BPW),

17 = 4040/KVq4_0 (4.5BPW)

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17], default=0)

Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

REMARKS :
You MUST use Flash attention
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

BUILDS :
All builds, aka. Cublas 12.1/12.2 (recommended for Ada Lovelace and Ampere for additional PP speed), Cublas 11.4.4/11.7 (more pertinent than the 12.2 for Pascal, Maxwell.. And 11.4.4 for Kepler), and the standard one are including OpenBLAS, CLBLAST, and Vulkan support provided by the devs.

What's Changed

Full Changelog: v1.68b_b3132+1...v1.68d_b3141+3

Kobold.CPP_FrankenFork_v1.68b_b3132+1_AutoRopeFix

11 Jun 16:53
Compare
Choose a tag to compare

Frankenstein 1.68b "Fork" of KoboldCPP Experimental up to the 11/06/2024, 16h GMT+2.
Based on Llama.CPP b3132.

DISCLAIMER:

The KoboldCPP-Frankenstein builds are not supported by the KoboldCPP team, Github, or discord channel. They are for greedy-test and amusement only.
My KCPP-Frankenstein version number bumps as soon as the version number in the official experimental branch bumps. They are not "upgrades" over the official version. And they might be bugged at time: only the official KCPP releases are to be considered correctly numbered, reliable and "fixed".
The LllamaCPP version + the additional PRs integrated follow my KCPP-Frankenstein versioning in the title, so everybody knows what version they deal with.

For KCPP official version, it's here : https://github.com/LostRuins/koboldcpp/releases

FRANKENSTEIN FEATURES:

Enhanced benchmark (reflecting a maximum of indicators, including the KV cache option)

18 KV cache options (all should be considered experimental except F16 and KV Q8_0) 👍

F16 -> Fullproof (the usual KV quant since the beginning of LCPP/KCPP)
K F16 with : V Q8_0, Q5_1, Q5_0, Q4_1, Q4_0
K Q8_0 with : V Q8_0 (stable, my current main, part of the LCPP/KCPP main triplet), Q5_1 (maybe unstable), Q5_0 (maybe unstable), Q4_1 (maybe stable), the rest is untested beyond benches), Q4_0
K Q5_1 with : V Q5_1, Q5_0, Q4_1, Q4_0
K Q5_0 with : V Q4_0
K Q4_1 with : V Q4_0
KV Q4_0 (quite stable, if we consider that it's part of the LCPP/KCPP main triplet)
Works in command line, normally also via the GUI, and normally saves on .KCPPS config files.

A better Autorope thanks to askmyteapot's PR on KoboldCPP official github.

ARGUMENTS 👍

Note : I had to use a simple 0-17 numbering scheme to allow the GUI and the kcpps preset saving to work properly. The problems with the previous 4 numbers quant scheme are fixed.

--quantkv",
help="Sets the KV cache data type quantization.

0 = 1616/F16 (16 BPW),

1 = 1680/Kf16-Vq8_0 (12.25BPW),
2 = 1651/Kf16-Vq5_1 (11BPW),
3 = 1650/Kf16-Vq5_0 (10.75BPW),
4 = 1641/Kf16-Vq4_1 (10.5BPW),
5 = 1640/Kf16-Vq4_0 (10.25BPW),

6 = 8080/KVq8_0 (8.5 BPW),
7 = 8051/Kq8_0-Vq5_1 (7.25BPW),
8 = 8050/Kq8_0-Vq5_0 (7BPW),
9 = 8041/Kq8_0-Vq4_1 (6.75BPW),
10 = 8040/Kq8_0-Vq4_0 (6.5BPW),

11 = 5151/KVq5_1 (6BPW),
12 = 5150/Kq5_1-Vq5_0 (5.75BPW),
13 = 5141/Kq5_1-Vq4_1 (5.5BPW),
14 = 5140/Kq5_1-Vq4_0 (5.25BPW),

15 = 5040/Kq5_0-Vq4_0 (5BPW),

16 = 4140/Kq4_1-Vq4_0 (4.75BPW),

17 = 4040/KVq4_0 (4.5BPW)

choices=[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17], default=0)

Lowvram option's speed is (logically) boosted due to the smaller KV context in RAM. From 25%+ in KV Q8_0 to 50%+ in KV Q4_0.

REMARKS :
You MUST use Flash attention
(tag : --flashattention in CL, or in the GUI)
Contextshift doesn't work with anything else than KV F16, but Smartcontext does.

CREDITS :
Of course, all credits go to Concedo/LostRuins and the other contributors to KoboldCPP, and to GGermanov and all the other contributors to LlamaCPP. Special big-up to Johannes Gaessler for the quantized KV cache!
I'm just poking, merging, and building around their work.

BUILDS :
All builds, aka. Cublas 12.1/12.2 (recommended for Ada Lovelace and Ampere for additional PP speed), Cublas 11.4.4/11.7 (more pertinent than the 12.2 for Pascal, Maxwell.. And 11.4.4 for Kepler), and the standard one are including OpenBLAS, CLBLAST, and Vulkan support provided by the devs.

What's Changed

Full Changelog: v1.68a_b3124+1...v1.68b_b3132+1