Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SLM Chunking samples #217

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added .DS_Store
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove Mac files.

Binary file not shown.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ This project provides the following custom skills:
| [Embeddings](Vector/EmbeddingGenerator/README.md) | Generates vector embeddings with the [HuggingFace all-MiniLM-L6-v2 model](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) | Vector | ![python](https://img.shields.io/badge/language-python-orange) | ![functions](https://img.shields.io/badge/deploy-Functions-blue) | Manual |
| [HelloWorld](Template/HelloWorld/README.md) | A minimal skill that can be used as a starting point or template for your own skills. | Template | ![C#](https://img.shields.io/badge/language-C%23-brightgreen) | ![functions](https://img.shields.io/badge/deploy-Functions-blue) | ARM Template |
| [PythonFastAPI](Template/PythonFastAPI/README.md) | A production web server and api scaffold for a python power skill | Template | ![python](https://img.shields.io/badge/language-python-orange) | ![docker](https://img.shields.io/badge/deploy-Docker-blueviolet) | Terraform template |
| [SLM Chunking](./SLMPhi3ChunkingSkill/README.md) | Using Microsoft Phi-3 to chunking docs | Utility | ![python](https://img.shields.io/badge/language-python-orange) | ![functions](https://img.shields.io/badge/deploy-Functions-blue) | Manual |



Expand Down
Binary file added SLMPhi3ChunkingSkill/.DS_Store
Binary file not shown.
74 changes: 74 additions & 0 deletions SLMPhi3ChunkingSkill/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# **Using SLM(Phi-3) Chunking PDF Documents**

![slm](./imgs/SLMRAG.png)

This project focuses on implementing and exploring Chunking techniques. It's designed to enhance the efficiency and accuracy of data processing and retrieval in various applications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunking -> chunking


Key Features

- Chunking: Learn how to segment and process large datasets using advanced chunking methods.
- Integration with Phi-3: Utilize SLMs to enhance data processing capabilities.
- Practical Examples: Follow detailed examples and use cases to understand the application of SLM chunking in real-world scenarios.

Getting Started

To get started, you'll need:
- Basic knowledge of Python and data processing.
- Access to Azure services for implementing AI-driven solutions.

Highlights

- Comprehensive Documentation: Detailed notebook guides and documentation to help you understand and implement chunking.

Under the architecture model add the following

Retriever-Reader Architecture for Open-domain question answering (RAG Solution).

Here's a breakdown of the process depicted:

1. Question Input: The process starts with a question, represented by a magnifying glass and a question mark.

2. Extract: This stage involves extracting relevant information from various sources, such as documents, the web, and text snippets.

3. Chunking (Phi-3): The extracted data is processed and segmented using a method called SLM chunking. This stage is represented by overlapping circles with mathematical symbols, indicating data processing.

4. Azure AI Service: The processed data is then fed into an AI service provided by Azure, specifically using , to generate answers.


### **Steps**

1. Extract information including text, images and tables from the pdf document. PyPDF is used for text and images, and Azure Document Intelligence is used for tables

2. Slice the extracted text/table through Phi-3/Phi-3.5 Instruct. Of course, Prompt is very important. It is necessary to keep the context coherence and keep the original text as much as possible.

***Samples***

- Chunking Text : [Click here](./prototype/01.chunking_text.ipynb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the links directly on "chunking text", "chunking table" and "chunking image"


- Chunking Table : [Click here](./prototype/03.chunking_table.ipynb)

3. Understand the extracted images through Phi-3/Phi-3.5 Vision


***Samples***

- Chunking Image : [Click here](./prototype/03.chunking_imgs.ipynb)

4. Reorganize and merge the text, images and tables to complete the real chunking

5. Complete quantization through the Azure OpenAI Service text-ada-embedding model and save it to AI Service.


**This is SLM Trunking flow diagram**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this is the SLM trunking flow diagram"


![flow](./imgs/SLMFlow.png)



***Samples***

- SLM Chunking FLow : [Click here](./code/slm-chunking-flow/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, avoid "click here" links.


- SLM Chunking Chat : [Click here](./code/slm-chunking-chat/)


Binary file added SLMPhi3ChunkingSkill/code/.DS_Store
Binary file not shown.
37 changes: 37 additions & 0 deletions SLMPhi3ChunkingSkill/code/slm-chunking-chat/flow.dag.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
inputs:
question:
type: string
default: What hardware support is required to run ONNX?
outputs:
answer:
type: string
reference: ${aoai_gen_result.output}
nodes:
- name: question_embedding
type: python
source:
type: package
tool: promptflow.tools.embedding.embedding
inputs:
connection: AOAIConnection
input: ${inputs.question}
- name: search_azure_ai_search
type: python
source:
type: code
path: search_azure_ai_search.py
inputs:
vector: ${question_embedding.output}
aisearchconn: AzureAISearchConn
- name: aoai_gen_result
type: llm
source:
type: code
path: aoai_gen_result.jinja2
inputs:
deployment_name: GPT4OModel
temperature: 0.6
max_tokens: 500
question: ${search_azure_ai_search.output}
connection: AOAIConnection
api: chat
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@

from azure.core.credentials import AzureKeyCredential
from azure.search.documents.models import VectorizedQuery
from azure.search.documents import SearchClient
from promptflow.connections import CognitiveSearchConnection
from promptflow.core import tool


# The inputs section will change based on the arguments of the tool function, after you save the code
# Adding type to arguments and return value will help the system show the types properly
# Please update the function name/signature per need
@tool
def search_azure_ai_search(vector: list,aisearchconn: CognitiveSearchConnection) -> str:

search_client = SearchClient(endpoint=aisearchconn.api_base, index_name='slmindex', credential=AzureKeyCredential(aisearchconn.api_key))

vector_query = VectorizedQuery(vector=vector, k_nearest_neighbors=3, fields="chunking_vector")

results = search_client.search(
search_text=None,
vector_queries= [vector_query],
select=["chunking"],
)


content = ''

for result in results:
content += f"{result['chunking']}"
print(f"Chunking: {result['chunking']}")
print(f"Score: {result['@search.score']}")

return content

Large diffs are not rendered by default.

Loading