-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add SLM Chunking samples #217
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,74 @@ | ||
# **Using SLM(Phi-3) Chunking PDF Documents** | ||
|
||
![slm](./imgs/SLMRAG.png) | ||
|
||
This project focuses on implementing and exploring Chunking techniques. It's designed to enhance the efficiency and accuracy of data processing and retrieval in various applications. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Chunking -> chunking |
||
|
||
Key Features | ||
|
||
- Chunking: Learn how to segment and process large datasets using advanced chunking methods. | ||
- Integration with Phi-3: Utilize SLMs to enhance data processing capabilities. | ||
- Practical Examples: Follow detailed examples and use cases to understand the application of SLM chunking in real-world scenarios. | ||
|
||
Getting Started | ||
|
||
To get started, you'll need: | ||
- Basic knowledge of Python and data processing. | ||
- Access to Azure services for implementing AI-driven solutions. | ||
|
||
Highlights | ||
|
||
- Comprehensive Documentation: Detailed notebook guides and documentation to help you understand and implement chunking. | ||
|
||
Under the architecture model add the following | ||
|
||
Retriever-Reader Architecture for Open-domain question answering (RAG Solution). | ||
|
||
Here's a breakdown of the process depicted: | ||
|
||
1. Question Input: The process starts with a question, represented by a magnifying glass and a question mark. | ||
|
||
2. Extract: This stage involves extracting relevant information from various sources, such as documents, the web, and text snippets. | ||
|
||
3. Chunking (Phi-3): The extracted data is processed and segmented using a method called SLM chunking. This stage is represented by overlapping circles with mathematical symbols, indicating data processing. | ||
|
||
4. Azure AI Service: The processed data is then fed into an AI service provided by Azure, specifically using , to generate answers. | ||
|
||
|
||
### **Steps** | ||
|
||
1. Extract information including text, images and tables from the pdf document. PyPDF is used for text and images, and Azure Document Intelligence is used for tables | ||
|
||
2. Slice the extracted text/table through Phi-3/Phi-3.5 Instruct. Of course, Prompt is very important. It is necessary to keep the context coherence and keep the original text as much as possible. | ||
|
||
***Samples*** | ||
|
||
- Chunking Text : [Click here](./prototype/01.chunking_text.ipynb) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please put the links directly on "chunking text", "chunking table" and "chunking image" |
||
|
||
- Chunking Table : [Click here](./prototype/03.chunking_table.ipynb) | ||
|
||
3. Understand the extracted images through Phi-3/Phi-3.5 Vision | ||
|
||
|
||
***Samples*** | ||
|
||
- Chunking Image : [Click here](./prototype/03.chunking_imgs.ipynb) | ||
|
||
4. Reorganize and merge the text, images and tables to complete the real chunking | ||
|
||
5. Complete quantization through the Azure OpenAI Service text-ada-embedding model and save it to AI Service. | ||
|
||
|
||
**This is SLM Trunking flow diagram** | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "this is the SLM trunking flow diagram" |
||
|
||
![flow](./imgs/SLMFlow.png) | ||
|
||
|
||
|
||
***Samples*** | ||
|
||
- SLM Chunking FLow : [Click here](./code/slm-chunking-flow/) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same, avoid "click here" links. |
||
|
||
- SLM Chunking Chat : [Click here](./code/slm-chunking-chat/) | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
inputs: | ||
question: | ||
type: string | ||
default: What hardware support is required to run ONNX? | ||
outputs: | ||
answer: | ||
type: string | ||
reference: ${aoai_gen_result.output} | ||
nodes: | ||
- name: question_embedding | ||
type: python | ||
source: | ||
type: package | ||
tool: promptflow.tools.embedding.embedding | ||
inputs: | ||
connection: AOAIConnection | ||
input: ${inputs.question} | ||
- name: search_azure_ai_search | ||
type: python | ||
source: | ||
type: code | ||
path: search_azure_ai_search.py | ||
inputs: | ||
vector: ${question_embedding.output} | ||
aisearchconn: AzureAISearchConn | ||
- name: aoai_gen_result | ||
type: llm | ||
source: | ||
type: code | ||
path: aoai_gen_result.jinja2 | ||
inputs: | ||
deployment_name: GPT4OModel | ||
temperature: 0.6 | ||
max_tokens: 500 | ||
question: ${search_azure_ai_search.output} | ||
connection: AOAIConnection | ||
api: chat |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,33 @@ | ||
|
||
from azure.core.credentials import AzureKeyCredential | ||
from azure.search.documents.models import VectorizedQuery | ||
from azure.search.documents import SearchClient | ||
from promptflow.connections import CognitiveSearchConnection | ||
from promptflow.core import tool | ||
|
||
|
||
# The inputs section will change based on the arguments of the tool function, after you save the code | ||
# Adding type to arguments and return value will help the system show the types properly | ||
# Please update the function name/signature per need | ||
@tool | ||
def search_azure_ai_search(vector: list,aisearchconn: CognitiveSearchConnection) -> str: | ||
|
||
search_client = SearchClient(endpoint=aisearchconn.api_base, index_name='slmindex', credential=AzureKeyCredential(aisearchconn.api_key)) | ||
|
||
vector_query = VectorizedQuery(vector=vector, k_nearest_neighbors=3, fields="chunking_vector") | ||
|
||
results = search_client.search( | ||
search_text=None, | ||
vector_queries= [vector_query], | ||
select=["chunking"], | ||
) | ||
|
||
|
||
content = '' | ||
|
||
for result in results: | ||
content += f"{result['chunking']}" | ||
print(f"Chunking: {result['chunking']}") | ||
print(f"Score: {result['@search.score']}") | ||
|
||
return content |
Large diffs are not rendered by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove Mac files.