RAG Bootcamp

This is a collection of reference implementations for Vector Institute's RAG (Retrieval-Augmented Generation) Bootcamp, scheduled to take place from Nov 2024 to Jan 2025. It demonstrates some of the common methodologies used in RAG workflows (data ingestion, chunks, embeddings, vector databases, sparse/dense retrieval, reranking) using the popular Python LangChain and LlamaIndex libraries.

Reference Implementations

This repository includes several reference implementations showing different approaches and methodologies related to Retrieval-Augmented Generation.

Web Search: Popular LLMs like OpenAI's GPT-4o and Meta's Llama-3 are very good at processing natural language, but their knowledge is limited by the data they were trained on. As of November 2024, neither service can correctly answer the question "Who won the 2024 World Series of Baseball?"
Document Search: Use a collection of unstructured documents to answer domain-specific questions, like: "How many AI scholarships did Vector Institute award in 2022?"
SQL Search: Answer natural language questions with information from structured relational data. This demo uses a financial dataset from a Portugese banking instituation, available on Kaggle
Cloud Search: Retrieve information from data in a cloud service, in this example AWS S3 storage
PubMed QA: A full pipeline on the PubMed dataset demonstrating ingestion, embeddings, vector index/storage, retrieval, reranking, with a focus on evaluation metrics.
RAG Evaluation: RAG evaluation techniques based on the Ragas framework. Focuses on evaluation "test sets" and how to use these to determine how well a RAG pipeline is actually working.

Requirements

Python 3.10+

Git Repostory

Start by cloning this git repository to a local folder:

git clone https://github.com/VectorInstitute/rag_bootcamp

[Optional] Build the virtual Python environments

These instructions only apply if you are not running this code on the Vector Institute cluster. If you are are working on the Vector cluster, these environments are already pre-compiled and ready to use in the /ssd003/projects/aieng/public/rag_bootcamp/envs folder.

The notebooks contained in this repository depend on several different Python environments. Following table lists the environment for each notebook:

Notebooks	Environment
Web Search, Document Search, SQL Search, Cloud Search	`rag_dataloaders`
RAG Evaluation	`rag_evaluation`
PubMed QA	`rag_pubmed_qa`

Build these environments using the following instructions:

python3 --version # Make sure this shows Python 3.10+!

# Install the dataloaders environment
python3 -m venv ./rag_dataloaders
source rag_dataloaders/bin/activate
python3 -m pip install -r ./envs/rag_dataloaders/requirements.txt
deactivate

# Install the evaluation environment
python3 -m venv ./rag_evaluation
source rag_evaluation/bin/activate
python3 -m pip install -r ./envs/rag_evaluation/requirements.txt
deactivate

# Install the pubmed_qa environment
python3 -m venv ./rag_pubmed_qa
source rag_pubmed_qa/bin/activate
python3 -m pip install -r ./envs/rag_pubmed_qa/requirements.txt
deactivate

Add the Jupyter notebook kernels

These kernels are required for the notebooks in this repository. You can make them available to Jupyter with the following instructions:

# The following path is for use on the Vector cluster. If you are using a different environment, update this accordingly.
export RAG_BOOTCAMP_ENV="/ssd003/projects/aieng/public/rag_bootcamp/envs"

source $RAG_BOOTCAMP_ENV/rag_dataloaders/bin/activate
ipython kernel install --user --name=rag_dataloaders
deactivate

source $RAG_BOOTCAMP_ENV/rag_evaluation/bin/activate
ipython kernel install --user --name=rag_evaluation
deactivate

source $RAG_BOOTCAMP_ENV/rag_pubmed_qa/bin/activate
ipython kernel install --user --name=rag_pubmed_qa
deactivate

Lastly, start a Jupyter notebook

# The following path is for use on the Vector cluster. If you are using a different environment, update this accordingly.
export RAG_BOOTCAMP_ENV="/ssd003/projects/aieng/public/rag_bootcamp/envs"

source $RAG_BOOTCAMP_ENV/<env_to_be_used>/bin/activate
jupyter notebook --ip $(hostname --fqdn)

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
cloud_search		cloud_search
document_search		document_search
envs		envs
local_llama2		local_llama2
pubmed_qa		pubmed_qa
rag_evaluation		rag_evaluation
sql_search		sql_search
utils		utils
web_search		web_search
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Bootcamp

Reference Implementations

Requirements

Git Repostory

[Optional] Build the virtual Python environments

Add the Jupyter notebook kernels

Lastly, start a Jupyter notebook

About

Releases

Packages

Contributors 4

Languages

License

VectorInstitute/rag_bootcamp

Folders and files

Latest commit

History

Repository files navigation

RAG Bootcamp

Reference Implementations

Requirements

Git Repostory

[Optional] Build the virtual Python environments

Add the Jupyter notebook kernels

Lastly, start a Jupyter notebook

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages