Skip to content

Commit

Permalink
added readme
Browse files Browse the repository at this point in the history
  • Loading branch information
marsupialtail committed Apr 18, 2024
1 parent 5f18a9f commit 096a01e
Showing 1 changed file with 22 additions and 6 deletions.
28 changes: 22 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Rottnest : Data Lake Indices

You have your data already in Parquet format in Iceberg or Delta (or json.gz, CSV ,etc). Unfortunately you need to do some stuff like full text search or vector search, so you have to ETL into Clickhouse, ElasticSearch or some vector database. It is complicated and expensive. Now you have an alternative.
You don't need ElasticSearch or some vector database to do full text search or vector search. Parquet + Rottnest is all you need. Rottnest is like Postgres indices for Parquet.

## Installation

Expand All @@ -10,20 +10,36 @@ Kubernetes Operator (upcoming)

## How to use

Build indices on your Parquet files, merge them, and query them. Very simple. Let's walk through a very simple example.
Build indices on your Parquet files, merge them, and query them. Very simple. Let's walk through a very simple example, in `demo.py`. It builds a BM25 index on two Parquet files, merges the indices, and searches the merged index for records related to cell phones. The code is here:

### BM25
```
import rottnest
rottnest.index_file_bm25("example_data/0.parquet", "body", "index0")
rottnest.index_file_bm25("example_data/1.parquet", "body", "index1")
rottnest.merge_index_bm25("merged_index", ["index0", "index1"])
result = rottnest.search_index_bm25(["merged_index"], "cell phones", K = 10)
```

This code will still work if the Parquet files are in fact **on object storage**. You can copy the data files to an S3 bucket, say `s3://example_data/`. Then the following code will work:

`rottnest.index_file_bm25(f"msmarco/chunk_{i}.parquet","body", name = f"msmarco_index/{i}")`
```
import rottnest
rottnest.index_file_bm25("s3://example_data/0.parquet", "body", "index0")
rottnest.index_file_bm25("s3://example_data/1.parquet", "body", "index1")
rottnest.merge_index_bm25("merged_index", ["index0", "index1"])
result = rottnest.search_index_bm25(["merged_index"], "cell phones", K = 10)
```

`rottnest.merge_index_bm25("msmarco_index/merged", [f"msmarco_index/{i}" for i in range(1,3)])`
It will use the index to search against the Parquet files on S3 directly. Rottnest has its own Parquet reader that makes this very very efficient.

`result = rottnest.search_index_bm25(["msmarco_index/merged"], "politics", K = 10,query_expansion = "openai")`
Rottnest not only supports BM25 indices but also other indices, like regex and vector searches. More documentation will be forthcoming.

### Regex

### Vector

## Architecture

![Architecture](assets/arch.png)

## Development
Expand Down

0 comments on commit 096a01e

Please sign in to comment.