-
Notifications
You must be signed in to change notification settings - Fork 1
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
96 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,96 @@ | ||
|
||
|
||
# Search Algorithms in Vector Stores | ||
In vector databases (or vector stores), search algorithms are used to efficiently retrieve the most similar vectors to a given query vector, often as part of a similarity search or nearest neighbor search. | ||
|
||
## 1. Brute Force Search (Exact Nearest Neighbor Search) | ||
- **Description**: Compares the query vector with all vectors in the database to find the exact nearest neighbors. | ||
- **Accuracy**: 100% accurate (exhaustive search). | ||
- **Speed**: Slow for large datasets. | ||
- **Use Case**: Best for small datasets where accuracy is the top priority. | ||
- **Examples**: Euclidean distance, cosine similarity, dot product calculations. | ||
|
||
## 2. Approximate Nearest Neighbor (ANN) Search | ||
- **Description**: Trades off some accuracy for faster search speeds using probabilistic techniques. | ||
- **Accuracy**: High, but not exact. | ||
- **Speed**: Much faster than brute force, especially for large datasets. | ||
- **Use Case**: Real-time applications where slight reductions in accuracy are acceptable. | ||
- **Examples**: | ||
- **HNSW**: Builds a graph where nodes are vectors and edges represent similarity. | ||
- **IVF**: Clusters vectors and searches within relevant clusters. | ||
- **FAISS**: Implements HNSW and IVF for fast similarity search. | ||
- **Annoy**: Uses binary trees for approximate nearest neighbors. | ||
- **ScaNN**: Google's fast and scalable similarity search. | ||
|
||
## 3. Tree-Based Search Algorithms | ||
- **KD-Tree (K-Dimensional Tree)**: | ||
- **Description**: Recursively partitions vector space based on dimensions. | ||
- **Speed**: Fast for lower-dimensional data, less effective for high dimensions. | ||
- **Use Case**: Suitable for small to medium datasets with low dimensions. | ||
- **Ball Tree**: | ||
- **Description**: Partitions space into hyperspheres, better suited for high-dimensional data than KD-Tree. | ||
- **Use Case**: Suitable for medium-dimensional data. | ||
|
||
## 4. Graph-Based Search Algorithms | ||
- **HNSW (Hierarchical Navigable Small World)**: | ||
- **Description**: Builds a multi-layered proximity graph, navigated during search. | ||
- **Accuracy**: High, good balance of speed and accuracy. | ||
- **Speed**: Fast for insertions and queries. | ||
- **Use Case**: Very large datasets where approximate results are acceptable. | ||
- **NSW (Navigable Small World)**: | ||
- **Description**: A simpler, less efficient version of HNSW. | ||
- **Use Case**: Moderately large datasets. | ||
|
||
## 5. Inverted File Index (IVF) | ||
- **Description**: Clusters vectors (e.g., using k-means) and searches relevant clusters to reduce the search space. | ||
- **Accuracy**: High if combined with precise clustering. | ||
- **Speed**: Fast, especially with a large number of clusters. | ||
- **Use Case**: Large-scale datasets when combined with approximate search. | ||
- **Examples**: IVF is often combined with PQ and HNSW. | ||
|
||
## 6. Product Quantization (PQ) | ||
- **Description**: Compresses vectors into smaller codes for distance calculations, accelerating search. | ||
- **Accuracy**: High, but with some loss due to quantization errors. | ||
- **Speed**: Very fast, especially when combined with other indexing methods. | ||
- **Use Case**: Large datasets with priorities on storage and retrieval speed. | ||
- **Example**: Often used in FAISS, combined with IVF for enhanced performance. | ||
|
||
## 7. LSH (Locality Sensitive Hashing) | ||
- **Description**: Hashes vectors into buckets, comparing only vectors in the same bucket during a query. | ||
- **Accuracy**: Medium (approximate method). | ||
- **Speed**: Very fast, especially for high-dimensional data. | ||
- **Use Case**: Suitable for very high-dimensional data and fast, approximate results. | ||
- **Example**: Used in image retrieval and nearest neighbor search engines. | ||
|
||
## 8. Multi-Probe LSH | ||
- **Description**: Extends LSH by probing multiple buckets to improve accuracy. | ||
- **Accuracy**: Higher than basic LSH, still approximate. | ||
- **Speed**: Slightly slower than LSH, but still fast. | ||
- **Use Case**: A compromise between LSH’s speed and accuracy. | ||
|
||
## 9. Hybrid Search Algorithms | ||
- **Description**: Combines multiple techniques to optimize speed and accuracy. | ||
- **Examples**: | ||
- **FAISS**: Combines IVF, HNSW, and PQ. | ||
- **Qdrant**: Uses HNSW and integrates with precise vector models. | ||
- **Weaviate, Pinecone, Milvus**: Use configurable mixtures of approximate and exact methods. | ||
|
||
--- | ||
|
||
# Summary of Key Algorithms by Use Case | ||
|
||
- **Brute Force**: Best for small datasets and when accuracy is crucial. | ||
- **ANN (Approximate Nearest Neighbor)**: Best for large datasets, balancing speed and accuracy. | ||
- **HNSW**: Popular ANN algorithm for high-dimensional data. | ||
- **IVF**: Works well for large-scale, high-dimensional data when combined with PQ. | ||
- **LSH**: Suitable for extremely high-dimensional data when fast, approximate searches are needed. | ||
- **PQ (Product Quantization)**: Compresses vectors and speeds up similarity calculations. | ||
|
||
--- | ||
|
||
# Choosing an Algorithm | ||
|
||
- **Dataset Size**: Large-scale datasets often require approximate algorithms for speed. | ||
- **Dimensionality**: High-dimensional data benefits from algorithms like HNSW or LSH. | ||
- **Query Latency**: If low-latency retrieval is critical, approximate methods like ANN and PQ are preferred. | ||
- **Accuracy**: Exact methods like brute force are ideal for small datasets, while ANN methods offer a trade-off between speed and accuracy for larger datasets. |