All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
1.1.0 - 2023-05-01
- Remove dependency on the protobuf package by using a Rust implementation for serialization
- Tests were failing because of a breaking change in Nox
1.0.1 - 2023-04-30
- Update pre-commit hooks and dependencies
- Allow to use also protobuf 4
1.0.0 - 2022-05-17
- The storage backends do have now a method query_documents() to leverage economies of scale when querying multiple documents at once.
- The public interface of the library is declared stable, hence it is ready for version 1.0.
- char_ngrams() is now fully implemented in Rust, giving a speedup of 2x.
- minhash LSH uses the new query_documents() of the storage backends instead of running concurrent queries.
- Wrong operator precedence in minhash implementation, which lead to incorrect results.
- Incorrect parsing of tokenize argument for SimilarityStore for char_ngrams without padding.
0.10.0 - 2022-05-08
- Improved performance of SimilarityStore.query_top_n().
0.9.3 - 2022-04-05
- Fixes #63 which led to Exceptions in case of empty documents.
0.9.2 - 2022-03-29
- Fixes #62 which led to TypeErrors in case of multiple identical results.
0.9.1 - 2022-03-25
- Minimum number of hash permutations for Minhash LSH set to 16 to avoid artifacts as described in #61.
0.9.0 - 2022-03-13
- ScyllaDBStore now accepts a
table_prefix
setting.
- The classes in narrow_down.data_types were moved to narrow_down.storage.
- The
initialize()
method of the storage backends can now be called multiple times without issues.
- A use of collections.Counter as typehint broke mypy checks.
0.8.0 - 2022-02-23
- Direct InMemoryStore file serialization in the Rust backend. This avoids a memory peak and also improves the performance of the operation compared to (de-)serialization via the detour of a Python bytes object.
0.7.0 - 2022-02-06
- InMemoryStore can be serialized to and deserialized from MessagePack.
- SimilarityStore.top_n_query() now allows to find a limited number of most similar documents.
- SimilarityStore offers the option to validate the similarity score if the document is available to avoid false positives.
- SimilarityStore objects can now be created by a factory coroutine
create()
instead of calling first__init__()
and theninitialize()
. This makes the usage of the class more straight-forward. - The exact_part of a document is now also stored in storage level "Document".
- The InMemoryStore no longer uses Python dictionaries as storage, but rather a class in the Rust extension to reduce the memory footprint by a lot.
- The number of partitions is now stored in the database for the SQLite backend. This way the DB is self-contained and the user doesn't have to keep the number elsewhere.
0.6.0 - 2022-01-29
- Storage backend for ScyllaDB, a cassandra-like distributed key-value store.
- StoredDocument objects are now serialized with protobuf to increase speed and reduce storage consumption.
- Storage queries are done concurrently where possible
- ScyllaDB sessions are now reused which give a great performance benefit
- Integer overflows in the minhash calculation which reduced the quality of the permutations (hash functions). Depending on the input effectively max_uint32 was used instead of a prime number in the modulo calculation.
- The backend AsyncSQLiteStore is removed, because it turned out that aiosqlite relies on the user to guarantee that only one coroutine at a time tries writing. Otherwise, a "Database locked" exception is thrown. As the performance was anyway worse than expected it was removed.
0.5.0 - 2022-01-17
- The SQLite backends take now an init parameter "partitions" which leads to internally partitioned tables. This reduces query time by a lot.
- Parameters for SQLite were optimized in order to increase insertion speed and reduce the number of disk write operations.
0.4.0 - 2022-01-16
- Synchronous and Asynchronous SQLite storage backend
- Settings of SimilarityStore objects are now saved in the storage backend
- Deserialization of SimilarityStore from an existing storage backend is now possible
0.3.1 - 2022-01-14
- Wrong order of parameters for Minhash LSH in the SimilarityStore class
0.3.0 - 2022-01-09
- Documents can now be removed from the index again with SimilarityStore.remove_by_id()
0.2.1 - 2022-01-09
- CI only: Wrong URL of test-pypi
0.2.0 - 2022-01-09
- A rust extension was added and therefore moving the build system to Maturin.
- A Minhash-LSH data structure was implemented.
- Different tokenizers for character- and word-n-grams were implemented.
0.1.1 - 2021-12-30
0.1.0 - 2021-12-30
- First release on PyPI.