Skip to content

Commit

Permalink
updated markdown
Browse files Browse the repository at this point in the history
  • Loading branch information
S4nfs committed Sep 27, 2024
1 parent 885baa1 commit fb6bcc0
Showing 1 changed file with 96 additions and 0 deletions.
96 changes: 96 additions & 0 deletions AI/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@


# Search Algorithms in Vector Stores
In vector databases (or vector stores), search algorithms are used to efficiently retrieve the most similar vectors to a given query vector, often as part of a similarity search or nearest neighbor search.

## 1. Brute Force Search (Exact Nearest Neighbor Search)
- **Description**: Compares the query vector with all vectors in the database to find the exact nearest neighbors.
- **Accuracy**: 100% accurate (exhaustive search).
- **Speed**: Slow for large datasets.
- **Use Case**: Best for small datasets where accuracy is the top priority.
- **Examples**: Euclidean distance, cosine similarity, dot product calculations.

## 2. Approximate Nearest Neighbor (ANN) Search
- **Description**: Trades off some accuracy for faster search speeds using probabilistic techniques.
- **Accuracy**: High, but not exact.
- **Speed**: Much faster than brute force, especially for large datasets.
- **Use Case**: Real-time applications where slight reductions in accuracy are acceptable.
- **Examples**:
- **HNSW**: Builds a graph where nodes are vectors and edges represent similarity.
- **IVF**: Clusters vectors and searches within relevant clusters.
- **FAISS**: Implements HNSW and IVF for fast similarity search.
- **Annoy**: Uses binary trees for approximate nearest neighbors.
- **ScaNN**: Google's fast and scalable similarity search.

## 3. Tree-Based Search Algorithms
- **KD-Tree (K-Dimensional Tree)**:
- **Description**: Recursively partitions vector space based on dimensions.
- **Speed**: Fast for lower-dimensional data, less effective for high dimensions.
- **Use Case**: Suitable for small to medium datasets with low dimensions.
- **Ball Tree**:
- **Description**: Partitions space into hyperspheres, better suited for high-dimensional data than KD-Tree.
- **Use Case**: Suitable for medium-dimensional data.

## 4. Graph-Based Search Algorithms
- **HNSW (Hierarchical Navigable Small World)**:
- **Description**: Builds a multi-layered proximity graph, navigated during search.
- **Accuracy**: High, good balance of speed and accuracy.
- **Speed**: Fast for insertions and queries.
- **Use Case**: Very large datasets where approximate results are acceptable.
- **NSW (Navigable Small World)**:
- **Description**: A simpler, less efficient version of HNSW.
- **Use Case**: Moderately large datasets.

## 5. Inverted File Index (IVF)
- **Description**: Clusters vectors (e.g., using k-means) and searches relevant clusters to reduce the search space.
- **Accuracy**: High if combined with precise clustering.
- **Speed**: Fast, especially with a large number of clusters.
- **Use Case**: Large-scale datasets when combined with approximate search.
- **Examples**: IVF is often combined with PQ and HNSW.

## 6. Product Quantization (PQ)
- **Description**: Compresses vectors into smaller codes for distance calculations, accelerating search.
- **Accuracy**: High, but with some loss due to quantization errors.
- **Speed**: Very fast, especially when combined with other indexing methods.
- **Use Case**: Large datasets with priorities on storage and retrieval speed.
- **Example**: Often used in FAISS, combined with IVF for enhanced performance.

## 7. LSH (Locality Sensitive Hashing)
- **Description**: Hashes vectors into buckets, comparing only vectors in the same bucket during a query.
- **Accuracy**: Medium (approximate method).
- **Speed**: Very fast, especially for high-dimensional data.
- **Use Case**: Suitable for very high-dimensional data and fast, approximate results.
- **Example**: Used in image retrieval and nearest neighbor search engines.

## 8. Multi-Probe LSH
- **Description**: Extends LSH by probing multiple buckets to improve accuracy.
- **Accuracy**: Higher than basic LSH, still approximate.
- **Speed**: Slightly slower than LSH, but still fast.
- **Use Case**: A compromise between LSH’s speed and accuracy.

## 9. Hybrid Search Algorithms
- **Description**: Combines multiple techniques to optimize speed and accuracy.
- **Examples**:
- **FAISS**: Combines IVF, HNSW, and PQ.
- **Qdrant**: Uses HNSW and integrates with precise vector models.
- **Weaviate, Pinecone, Milvus**: Use configurable mixtures of approximate and exact methods.

---

# Summary of Key Algorithms by Use Case

- **Brute Force**: Best for small datasets and when accuracy is crucial.
- **ANN (Approximate Nearest Neighbor)**: Best for large datasets, balancing speed and accuracy.
- **HNSW**: Popular ANN algorithm for high-dimensional data.
- **IVF**: Works well for large-scale, high-dimensional data when combined with PQ.
- **LSH**: Suitable for extremely high-dimensional data when fast, approximate searches are needed.
- **PQ (Product Quantization)**: Compresses vectors and speeds up similarity calculations.

---

# Choosing an Algorithm

- **Dataset Size**: Large-scale datasets often require approximate algorithms for speed.
- **Dimensionality**: High-dimensional data benefits from algorithms like HNSW or LSH.
- **Query Latency**: If low-latency retrieval is critical, approximate methods like ANN and PQ are preferred.
- **Accuracy**: Exact methods like brute force are ideal for small datasets, while ANN methods offer a trade-off between speed and accuracy for larger datasets.

0 comments on commit fb6bcc0

Please sign in to comment.