Skip to content

Latest commit

 

History

History
118 lines (79 loc) · 4.37 KB

CHANGELOG.md

File metadata and controls

118 lines (79 loc) · 4.37 KB

Neodim Server - CHANGELOG

v0.13 (August 6, 2023)

  • Changed: GPTQ settings file (quantize_config.json) is now prioritized over the command line options
  • Added: words_blacklist_at_start parameter

v0.12 (June 11, 2023)

  • Added: support for GPTQ models.
  • Added: support for float4 precision
  • Added: can_stop_early parameter
  • Changed: switching to CUDA 11.8
  • Changed: all layers are distributed on the first GPU by default (if --layers is not specified)
  • Improved: int8 precision is now supported on all GPUs supported by Neodim Server
  • Improved: blacklist processing

v0.11 (April 23, 2023)

  • Added: no_repeat_ngram_size parameter
  • Added: words_blacklist parameter
  • Added: words_whitelist parameter
  • Added: support for LLaMA models
  • Added: support for safetensors
  • Improved: 8-bit models can now be loaded directly
  • Fixed: the playground misses some input parameters
  • Fixed: multiple encoding/decoding problems
  • Changed: the server now tries to generate at least one token

v0.10 (March 4, 2023)

  • Fixed: specifying top_k gives a type error
  • Added: contrastive search (see penalty_alpha parameter)

v0.9 (February 19, 2023)

  • Fixed: wrong truncation of the inference result
  • Changed: special tokens (e.g. <|endoftext|> or </s>) are now always removed from the input and the output
  • Added: server version validation (required_server_version request parameter)
  • Added: full output from the model is returned in the output_text response field
  • Added: actually used preamble is returned in the preamble response field
  • Improved: silenced messages: "the specified maximum sequence length" and "Welcome to bitsandbytes"

v0.8 (December 18, 2022)

  • Changed: using CUDA 11.7
  • Changed: stop_strings and truncate_prompt_until are not sorted anymore
  • Added: support for GPT-NeoX, CodeGen and BLOOM models
  • Added: ability to specify regular expressions as stop strings (see stop_strings_type, stop_strings_required_matches_count, request parameters and stop_string_match response parameter)
  • Improved: layers distribution is now supported for 8-bit precision (i.e. layers can be set to any supported value when precision=int8)
  • Improved: more readable display of free VRAM and layers distribution
  • Improved: repetition penalty can now be specified in warpers_order

v0.7 (September 18, 2022)

  • Added: ability to load the model in 32-bit and 8-bit precisions (precision CLI param)
  • Fixed: fast tokenizer is not working for OPT models

v0.6 (July 16, 2022)

  • Added: typical sampling (typical request param)
  • Added: top-a sampling (top-a request param)
  • Added: the order of filters/sampling/warpers can now be set (warpers_order request param)
  • Added: the playground settings can now be reset to their defaults

v0.5 (July 2, 2022)

  • Changed: using CUDA 11.6 (to support modern GPUs)
  • Improved: loading models is now 4 times faster
  • Improved: GPT2 models now support layers distribution
  • Added: support for OPT models
  • Added: use a symbol to put all unspecified layers on a GPU

v0.4 (June 26, 2022)

  • Fixed: wrong device detection when moving the entire model to GPU

v0.3 (April 10, 2022)

  • Added: --version flag to show the current version

v0.2 (April 2, 2022)

  • Fixed: avoid the crash with penalty slope when penalty range = 1

v0.1 (March 20, 2022)

  • Initial release