Skip to content

RAG Guide

Antonio Fin edited this page Nov 10, 2024 · 1 revision

In the LL-Mesh platform, Retrieval-Augmented Generation (RAG) is a crucial process that enhances the capabilities of language models by providing them with access to external knowledge. RAG is divided into two main stages: Injection and Retrieval. Each stage is carefully designed to standardize and optimize the use of data, ensuring that the generated content is both relevant and accurate. The RAG services handle

All these services are implemented following the Factory Design Pattern. Configuration settings and details of the general service can be found in the abstract base class, while instance-specific settings and results are documented within each specific implementation file.

Injection Process

The injection process involves preparing and integrating data into a storage system where it can be efficiently retrieved during the generation process. This process is abstracted into several key steps:

Extract

  • Data Gathering: Collect data from various sources such as DOCX, PDF, or other formats.
  • Conversion: Transform the gathered data into a common format, typically JSON, to ensure consistency.

Transform

  • Clean: Remove irrelevant or redundant information from the data, focusing on essential content.
  • Enrich Metadata: Add useful metadata that can enhance the searchability and context of the data during retrieval.
  • Transform with LLM: Convert the cleaned data into useful formats such as summaries, question-and-answer pairs, or other structured outputs that facilitate easier retrieval and use.

Load

  • Storage Injection: Inject the transformed data into the selected storage solution, such as a vector database.
  • Adaptation: Further adapt the data as needed, for example by chunking it into smaller pieces. This ensures that the data is optimized for efficient retrieval and can be accessed in a standardized format during the generation process.

rag_1

Retrieval Process

Once the data has been injected and is ready for use, the retrieval process focuses on fetching the most relevant information based on a given input query. This process ensures that the language model has access to the right data to generate accurate and contextually relevant outputs.

Search Techniques

  • Data Retrieval: Use various search techniques, such as dense or sparse retrieval methods, to fetch the most relevant data from storage.

Metadata Filtering

  • Refinement: Apply metadata filters to narrow down search results, ensuring that the retrieved data matches the specific needs of the query.

Chunk Expansion

  • Data Expansion: Expand the retrieved data chunks by section or sentences to provide a more comprehensive and contextually relevant set of information for the language model to use. This step is essential for generating outputs that accurately reflect the intended context and nuances of the input query.

rag_2

Data Extraction

Data extraction is a crucial step in the RAG process within the Athon platform, where information is gathered from various document types and prepared for further processing. The data extraction service is designed to handle a variety of formats, such as PDFs and DOCX files, and convert them into a standardized format (e.g., JSON) for use in subsequent steps like transformation and loading.

Example: Data Extraction with UnstructuredSections

Here’s an example of how you can use the UnstructuredSectionsDataExtractor class to extract data from a document.

from athon.rag import DataExtractor

# Example configuration for the Data Extractor
EXTRACTOR_CONFIG = {
    'type': 'UnstructuredSections',
    'document_type': 'Pdf',
    'cache_elements_to_file': True,
    'extract_text': True,
    'exclude_header': True,
    'exclude_footer': True,
    'extract_image': False,
    'image_output_folder': './images'
}

# Initialize the Data Extractor with the provided configuration
data_extractor = DataExtractor.create(EXTRACTOR_CONFIG)

# Parse a document file
file_path = 'example_document.pdf'
result = data_extractor.parse(file_path)

# Handle the extraction result
if result.status == "success":
    print(f"EXTRACTED ELEMENTS:\n{result.elements}")
else:
    print(f"ERROR:\n{result.error_message}")

Data Transformation

Data transformation in the RAG process within the LL-Mesh platform involves cleaning, transforming, and enriching data extracted from various sources. This stage prepares the data for loading into storage, ensuring it is optimized for retrieval and use by language models.

Example: Data Transformation with CteActionRunner

Here’s an example of how you can use the CteActionRunnerDataTransformer class to perform various data transformation actions.

from athon.rag import DataTransformer

# Example configuration for the Data Transformer
TRANSFORMER_CONFIG = {
    'type': 'CteActionRunner',
    'clean': {
        'headers_to_remove': ['Confidential', 'Draft'],
        'min_section_length': 100
    },
    'transform': {
        'llm_config': {
            'type': 'LangChainChatOpenAI',
            'api_key': 'your-api-key-here',
            'model_name': 'gpt-4o'
        },
        'system_prompt': 'Summarize the following content.',
        'transform_delimeters': ['```', '```json']
    },
    'enrich': {
        'metadata': {
            'source': 'Athon Platform',
            'processed_by': 'CteActionRunner'
        }
    }
}

# Initialize the Data Transformer with the provided configuration
data_transformer = DataTransformer.create(TRANSFORMER_CONFIG)

# Example list of extracted elements to be transformed
extracted_elements = [
    {"text": "Confidential Report on AI Development", "metadata": {"type": "Header"}},
    {"text": "AI is transforming industries worldwide...", "metadata": {"type": "Paragraph"}}
]

# Define the actions to be performed
actions = ['RemoveSectionsByHeader', 'TransformInSummary', 'EnrichMetadata']

# Process the elements
result = data_transformer.process(actions, extracted_elements)

# Handle the transformation result
if result.status == "success":
    print(f"TRANSFORMED ELEMENTS:\n{result.elements}")
else:
    print(f"ERROR:\n{result.error_message}")

Key Transformation Actions

  • Cleaning: Removes unnecessary elements such as headers or short sections.
  • Transforming: Uses an LLM to convert text into summaries, Q&A pairs, or other formats.
  • Enriching: Adds metadata to the elements to enhance their usability during retrieval.

Data Storage

Data storage is a critical component in the RAG process within the LL-Mesh platform. After data has been extracted and transformed, it must be stored efficiently to allow for quick and accurate retrieval during the generation process. The platform supports various storage strategies, including the use of specialized databases such as vector store.

Example: Data Storage with ChromaCollection

Here’s an example of how you can use the ChromaCollectionDataStorage class to manage collections within Chroma dB.

from athon.rag import DataStorage

# Example configuration for the Data Storage
STORAGE_CONFIG = {
    'type': 'ChromaCollection',
    'path': './chroma_db',
    'collection': 'my_collection',
    'reset': True,
    'metadata': {'hnsw:space': 'cosine'},
    'embeddings_model': 'all-MiniLM-L6-v2'
}

# Initialize the Data Storage with the provided configuration
data_storage = DataStorage.create(STORAGE_CONFIG)

# Retrieve the data collection
result = data_storage.get_collection()

# Handle the retrieval result
if result.status == "success":
    print(f"COLLECTION RETRIEVED:\n{result.collection}")
else:
    print(f"ERROR:\n{result.error_message}")

Data Loader

The data loader is a crucial component in the RAG process within the LL-Mesh platform. After data has been extracted, transformed, and stored, the final step involves loading this data into the storage system. This ensures that the data is readily available for retrieval and use by language models.

Example: Data Loader with ChromaForSentence

Here’s an example of how you can use the ChromaForSentenceDataLoader class to insert data into a Chroma dB collection.

from athon.rag import DataLoader

# Example configuration for the Data Loader
LOADER_CONFIG = {
    'type': 'ChromaForSentences'
}

# Initialize the Data Loader with the provided configuration
data_loader = DataLoader.create(LOADER_CONFIG)

# Example collection (retrieved from a DataStorage instance)
collection = data_storage.get_collection().collection

# Example list of elements to be inserted
elements = [
    {"text": "Generative AI is transforming industries.", "metadata": {"category": "AI", "importance": "high"}},
    {"text": "This document discusses the impact of AI.", "metadata": {"category": "AI", "importance": "medium"}}
]

# Insert the elements into the collection
result = data_loader.insert(collection, elements)

# Handle the insertion result
if result.status == "success":
    print("Data successfully inserted into the collection.")
else:
    print(f"ERROR:\n{result.error_message}")

Data Retriever

The data retriever is a key component within the RAG process, responsible for fetching relevant data from the storage based on user queries. This process involves not just searching for the most relevant chunks of data, but also expanding and refining those results to provide comprehensive and contextually relevant information.

Example: Data Retrieval with ChromaForSentence

Here’s an example of how you can use the ChromaForSentenceDataRetriever class to retrieve data from a Chroma dB collection:

from athon.rag import DataRetriever

# Example configuration for the Data Retriever
RETRIEVER_CONFIG = {
    'type': 'ChromaForSentences',
    'expansion_type': 'Section',
    'sentence_window': 3,
    'n_results': 10,
    'include': ['documents', 'metadatas']
}

# Initialize the Data Retriever with the provided configuration
data_retriever = DataRetriever.create(RETRIEVER_CONFIG)

# Example collection (retrieved from a DataStorage instance)
collection = data_storage.get_collection().collection

# Example query to search within the collection
query = "What is the impact of Generative AI on industries?"

# Retrieve the relevant data based on the query
result = data_retriever.select(collection, query)

# Handle the retrieval result
if result.status == "success":
    for element in result.elements:
        print(f"TEXT:\n{element['text']}\nMETADATA:\n{element['metadata']}\n")
else:
    print(f"ERROR:\n{result.error_message}")