Vector house - this is where the fun happens.
To see more, please read the project report.
Vector house is a vector based search engine used to search en variant of Wikipedia. It was developed in 2023 by Naďa Fučelová and Petr Laštovička during the BI-VWM (Web and multimedia db searching) course at FIT CTU.
It the site is still up, you can try it out at [https://lastope2.sh.cvut.cz/vector_house/].
Download the latest dump from wiki
https://dumps.wikimedia.org/enwiki/
and extract it in the wiki-data
folder.
Create a virtual env python -m venv .venv
, source it source .venv/bin/activate
and install requirements
pip install -r requirements.txt
.
To open the page go to use streamlit run vector_house/page.py
To view help, run python -m vector_house --help
or ./run --help
.
All the commands below use the default database wiki-index.db
unless you specify another one by using the --db path
option.
To search the database run ./run search query
.
To search for similar documents use
./run sim doc_id
where the doc_id
is returned by
the search
or sim
function.
To view a found document, run ./run show doc_id
.
To create an index run this cli command ./run index
.
If you want to limit the number of words processed in each document,
add also the flag --limit
with the number of words.
The default limit is 42069 words.
If you want for each term store only the top n documents
with the highest score use --top-docs
option.
Otherwise the count is not limited.
Index size (doc count) is set to 8000 by default. You can change it with
--size
flag in combination with the index
frag.
Run ./run info
to show db internal info.
Run ./run db-index {create|drop}
to create/drop database column indexes.
Run ./run benchmark
to start auto benchmarks.
Run ./run benchmark --create-index
once before to create more different indexes.
To run tests, run the pytest vector_house
.
Vector house is licensed under the GNU GPL v3.0
license