updated markdown

S4nfs · Sep 27, 2024 · fb6bcc0 · fb6bcc0
1 parent 885baa1
commit fb6bcc0
Showing 1 changed file with 96 additions and 0 deletions.
diff --git a/AI/README.md b/AI/README.md
@@ -0,0 +1,96 @@
+
+
+# Search Algorithms in Vector Stores
+In vector databases (or vector stores), search algorithms are used to efficiently retrieve the most similar vectors to a given query vector, often as part of a similarity search or nearest neighbor search.
+
+## 1. Brute Force Search (Exact Nearest Neighbor Search)
+- **Description**: Compares the query vector with all vectors in the database to find the exact nearest neighbors.
+- **Accuracy**: 100% accurate (exhaustive search).
+- **Speed**: Slow for large datasets.
+- **Use Case**: Best for small datasets where accuracy is the top priority.
+- **Examples**: Euclidean distance, cosine similarity, dot product calculations.
+
+## 2. Approximate Nearest Neighbor (ANN) Search
+- **Description**: Trades off some accuracy for faster search speeds using probabilistic techniques.
+- **Accuracy**: High, but not exact.
+- **Speed**: Much faster than brute force, especially for large datasets.
+- **Use Case**: Real-time applications where slight reductions in accuracy are acceptable.
+- **Examples**:
+  - **HNSW**: Builds a graph where nodes are vectors and edges represent similarity.
+  - **IVF**: Clusters vectors and searches within relevant clusters.
+  - **FAISS**: Implements HNSW and IVF for fast similarity search.
+  - **Annoy**: Uses binary trees for approximate nearest neighbors.
+  - **ScaNN**: Google's fast and scalable similarity search.
+
+## 3. Tree-Based Search Algorithms
+- **KD-Tree (K-Dimensional Tree)**:
+  - **Description**: Recursively partitions vector space based on dimensions.
+  - **Speed**: Fast for lower-dimensional data, less effective for high dimensions.
+  - **Use Case**: Suitable for small to medium datasets with low dimensions.
+- **Ball Tree**:
+  - **Description**: Partitions space into hyperspheres, better suited for high-dimensional data than KD-Tree.
+  - **Use Case**: Suitable for medium-dimensional data.
+
+## 4. Graph-Based Search Algorithms
+- **HNSW (Hierarchical Navigable Small World)**:
+  - **Description**: Builds a multi-layered proximity graph, navigated during search.
+  - **Accuracy**: High, good balance of speed and accuracy.
+  - **Speed**: Fast for insertions and queries.
+  - **Use Case**: Very large datasets where approximate results are acceptable.
+- **NSW (Navigable Small World)**:
+  - **Description**: A simpler, less efficient version of HNSW.
+  - **Use Case**: Moderately large datasets.
+
+## 5. Inverted File Index (IVF)
+- **Description**: Clusters vectors (e.g., using k-means) and searches relevant clusters to reduce the search space.
+- **Accuracy**: High if combined with precise clustering.
+- **Speed**: Fast, especially with a large number of clusters.
+- **Use Case**: Large-scale datasets when combined with approximate search.
+- **Examples**: IVF is often combined with PQ and HNSW.
+
+## 6. Product Quantization (PQ)
+- **Description**: Compresses vectors into smaller codes for distance calculations, accelerating search.
+- **Accuracy**: High, but with some loss due to quantization errors.
+- **Speed**: Very fast, especially when combined with other indexing methods.
+- **Use Case**: Large datasets with priorities on storage and retrieval speed.
+- **Example**: Often used in FAISS, combined with IVF for enhanced performance.
+
+## 7. LSH (Locality Sensitive Hashing)
+- **Description**: Hashes vectors into buckets, comparing only vectors in the same bucket during a query.
+- **Accuracy**: Medium (approximate method).
+- **Speed**: Very fast, especially for high-dimensional data.
+- **Use Case**: Suitable for very high-dimensional data and fast, approximate results.
+- **Example**: Used in image retrieval and nearest neighbor search engines.
+
+## 8. Multi-Probe LSH
+- **Description**: Extends LSH by probing multiple buckets to improve accuracy.
+- **Accuracy**: Higher than basic LSH, still approximate.
+- **Speed**: Slightly slower than LSH, but still fast.
+- **Use Case**: A compromise between LSH’s speed and accuracy.
+
+## 9. Hybrid Search Algorithms
+- **Description**: Combines multiple techniques to optimize speed and accuracy.
+- **Examples**:
+  - **FAISS**: Combines IVF, HNSW, and PQ.
+  - **Qdrant**: Uses HNSW and integrates with precise vector models.
+  - **Weaviate, Pinecone, Milvus**: Use configurable mixtures of approximate and exact methods.
+
+---
+
+# Summary of Key Algorithms by Use Case
+
+- **Brute Force**: Best for small datasets and when accuracy is crucial.
+- **ANN (Approximate Nearest Neighbor)**: Best for large datasets, balancing speed and accuracy.
+- **HNSW**: Popular ANN algorithm for high-dimensional data.
+- **IVF**: Works well for large-scale, high-dimensional data when combined with PQ.
+- **LSH**: Suitable for extremely high-dimensional data when fast, approximate searches are needed.
+- **PQ (Product Quantization)**: Compresses vectors and speeds up similarity calculations.
+
+---
+
+# Choosing an Algorithm
+
+- **Dataset Size**: Large-scale datasets often require approximate algorithms for speed.
+- **Dimensionality**: High-dimensional data benefits from algorithms like HNSW or LSH.
+- **Query Latency**: If low-latency retrieval is critical, approximate methods like ANN and PQ are preferred.
+- **Accuracy**: Exact methods like brute force are ideal for small datasets, while ANN methods offer a trade-off between speed and accuracy for larger datasets.