- Changed: GPTQ settings file (
quantize_config.json
) is now prioritized over the command line options - Added: words_blacklist_at_start parameter
- Added: support for GPTQ models.
- Added: support for
float4
precision - Added: can_stop_early parameter
- Changed: switching to CUDA 11.8
- Changed: all layers are distributed on the first GPU by default (if --layers is not specified)
- Improved:
int8
precision is now supported on all GPUs supported by Neodim Server - Improved: blacklist processing
- Added: no_repeat_ngram_size parameter
- Added: words_blacklist parameter
- Added: words_whitelist parameter
- Added: support for LLaMA models
- Added: support for safetensors
- Improved: 8-bit models can now be loaded directly
- Fixed: the playground misses some input parameters
- Fixed: multiple encoding/decoding problems
- Changed: the server now tries to generate at least one token
- Fixed: specifying
top_k
gives a type error - Added: contrastive search (see penalty_alpha parameter)
- Fixed: wrong truncation of the inference result
- Changed: special tokens (e.g.
<|endoftext|>
or</s>
) are now always removed from the input and the output - Added: server version validation (required_server_version request parameter)
- Added: full output from the model is returned in the output_text response field
- Added: actually used preamble is returned in the preamble response field
- Improved: silenced messages: "the specified maximum sequence length" and "Welcome to bitsandbytes"
- Changed: using CUDA 11.7
- Changed:
stop_strings
andtruncate_prompt_until
are not sorted anymore - Added: support for GPT-NeoX, CodeGen and BLOOM models
- Added: ability to specify regular expressions as stop strings (see stop_strings_type, stop_strings_required_matches_count, request parameters and stop_string_match response parameter)
- Improved: layers distribution is now supported for 8-bit precision
(i.e.
layers
can be set to any supported value whenprecision=int8
) - Improved: more readable display of free VRAM and layers distribution
- Improved: repetition penalty can now be specified in warpers_order
- Added: ability to load the model in 32-bit and 8-bit precisions (precision CLI param)
- Fixed: fast tokenizer is not working for OPT models
- Added: typical sampling (
typical
request param) - Added: top-a sampling (
top-a
request param) - Added: the order of filters/sampling/warpers can now be set (
warpers_order
request param) - Added: the playground settings can now be reset to their defaults
- Changed: using CUDA 11.6 (to support modern GPUs)
- Improved: loading models is now 4 times faster
- Improved: GPT2 models now support layers distribution
- Added: support for OPT models
- Added: use
a
symbol to put all unspecified layers on a GPU
- Fixed: wrong device detection when moving the entire model to GPU
- Added:
--version
flag to show the current version
- Fixed: avoid the crash with penalty slope when penalty range = 1
- Initial release