Neodim Server - CHANGELOG

v0.13 (August 6, 2023)

Changed: GPTQ settings file (quantize_config.json) is now prioritized over the command line options
Added: words_blacklist_at_start parameter

v0.12 (June 11, 2023)

Added: support for GPTQ models.
Added: support for float4 precision
Added: can_stop_early parameter
Changed: switching to CUDA 11.8
Changed: all layers are distributed on the first GPU by default (if --layers is not specified)
Improved: int8 precision is now supported on all GPUs supported by Neodim Server
Improved: blacklist processing

v0.11 (April 23, 2023)

Added: no_repeat_ngram_size parameter
Added: words_blacklist parameter
Added: words_whitelist parameter
Added: support for LLaMA models
Added: support for safetensors
Improved: 8-bit models can now be loaded directly
Fixed: the playground misses some input parameters
Fixed: multiple encoding/decoding problems
Changed: the server now tries to generate at least one token

v0.10 (March 4, 2023)

Fixed: specifying top_k gives a type error
Added: contrastive search (see penalty_alpha parameter)

v0.9 (February 19, 2023)

Fixed: wrong truncation of the inference result
Changed: special tokens (e.g. <|endoftext|> or </s>) are now always removed from the input and the output
Added: server version validation (required_server_version request parameter)
Added: full output from the model is returned in the output_text response field
Added: actually used preamble is returned in the preamble response field
Improved: silenced messages: "the specified maximum sequence length" and "Welcome to bitsandbytes"

v0.8 (December 18, 2022)

Changed: using CUDA 11.7
Changed: stop_strings and truncate_prompt_until are not sorted anymore
Added: support for GPT-NeoX, CodeGen and BLOOM models
Added: ability to specify regular expressions as stop strings (see stop_strings_type, stop_strings_required_matches_count, request parameters and stop_string_match response parameter)
Improved: layers distribution is now supported for 8-bit precision (i.e. layers can be set to any supported value when precision=int8)
Improved: more readable display of free VRAM and layers distribution
Improved: repetition penalty can now be specified in warpers_order

v0.7 (September 18, 2022)

Added: ability to load the model in 32-bit and 8-bit precisions (precision CLI param)
Fixed: fast tokenizer is not working for OPT models

v0.6 (July 16, 2022)

Added: typical sampling (typical request param)
Added: top-a sampling (top-a request param)
Added: the order of filters/sampling/warpers can now be set (warpers_order request param)
Added: the playground settings can now be reset to their defaults

v0.5 (July 2, 2022)

Changed: using CUDA 11.6 (to support modern GPUs)
Improved: loading models is now 4 times faster
Improved: GPT2 models now support layers distribution
Added: support for OPT models
Added: use a symbol to put all unspecified layers on a GPU

v0.4 (June 26, 2022)

Fixed: wrong device detection when moving the entire model to GPU

v0.3 (April 10, 2022)

Added: --version flag to show the current version

v0.2 (April 2, 2022)

Fixed: avoid the crash with penalty slope when penalty range = 1

v0.1 (March 20, 2022)

Initial release