Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SLM Chunking samples #217

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Add SLM Chunking samples #217

wants to merge 1 commit into from

Conversation

kinfey
Copy link

@kinfey kinfey commented Nov 3, 2024

Purpose

Add SLM chunking with Azure Search

Does this introduce a breaking change?

[ x] No

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove Mac files.


***Samples***

- Chunking Text : [Click here](./prototype/01.chunking_text.ipynb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the links directly on "chunking text", "chunking table" and "chunking image"

5. Complete quantization through the Azure OpenAI Service text-ada-embedding model and save it to AI Service.


**This is SLM Trunking flow diagram**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"this is the SLM trunking flow diagram"


![slm](./imgs/SLMRAG.png)

This project focuses on implementing and exploring Chunking techniques. It's designed to enhance the efficiency and accuracy of data processing and retrieval in various applications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chunking -> chunking


***Samples***

- SLM Chunking FLow : [Click here](./code/slm-chunking-flow/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, avoid "click here" links.

@@ -0,0 +1 @@
{"check_json.output":" ```json[ { \"chunking\": \"Concepts & resourcesAI Models & Deployments\\nWhat is an AI Model?\\nAn AI model (or machine learning model) is a program that has been trained on a set of data, to recognize certain types of patterns. Training the model defines an algorithm that the AI can use to reason over new data and make predictions.\\n🔖 | Learn more\\nWhat is a Large Language Model?\\nA large language model (LLM) is a type of AI that can process and produce natural language text, having been trained on massive amounts of data from diverse sources. A \\\"foundation model\\\" refers to a specific instance or version of an LLM. We'll cover these topics in more detail in the next lesson.\\n🔖 | Learn more\\n#10/31/24, 8:51 PM AI Models & Deployments | Learn how to interact with OpenAI models\\nhttps://microsoft.github.io/Workshop-Interact-with-OpenAI-models/ai-models/\\n1/4What are Embeddings?\\nAn embedding is a special data representation format that machine learning models and algorithms can use more easily. It provides an information-dense representation of the semantic meaning of text data as a vector of floating point numbers. The distance between embeddings in vector space correlates directly to the semantic similarity between their (original) text inputs.\\n🔖 | Learn more\\nEmbeddings help us use vector search methods for more efficient querying of text data. For example: it powers vector similarity search in databases like Azure Cosmos DB for MongoDB vCore. The recommended embedding model is currently text-embedding-ada-002.\\n🔖 | Learn more\\nWhat Model should I use?\\nThere are many considerations when choosing a model. Model pricing (by tokens, by artifacts), Model availability (by version, by region), Model performance (evaluation metrics), Model capability (features & parameters).\\nAs a general guide, we recommend the following: Start with gpt-35-turbo. This model is very economical and has good performance. It's commonly used for chat applications (such as OpenAI's ChatGPT) but can be used for a wide range of tasks beyond chat and conversation.\\nMove to gpt-35-turbo-16k, gpt-4 or gpt-4-32k if you need to generate more than 4,096 tokens, or need to support larger prompts. These models are more expensive and can be slower, and have limited availability, but they are the most powerful models available today.\\nConsider embeddings for tasks like search, clustering, recommendations and anomaly detection.\\n10/31/24, 8:51 PM AI Models & Deployments | Learn how to interact with OpenAI models\\nhttps://microsoft.github.io/Workshop-Interact-with-OpenAI-models/ai-models/\\n2/4Use DALL-E (Preview) for generating images from text prompts that the user provides, unlike previous models where the output is text (chat).\\nUse Whisper (Preview) for speech-to-text conversion or audio transcription. It's trained and optimized for transcribing audio files with English speech, though it can transcribe speech in other languages. The model output is in English text.\\n🔖 | Learn more\\nWhat is Azure OpenAI (AOAI)\\nOpenAI has a diverse set of language models that can \\\"generate\\\" different types of content (text, images, audio, code) from a user-provided natural language text input or \"prompt\". The Azure OpenAI Service provides access to these OpenAI models over a REST API.\\nCurrently available models include GPT-4, GPT-4 Turbo Preview, GPT-3.5, Embeddings, DALL-E (Preview) and Whisper (Preview). Azure OpenAI releases new versions regularly to keep pace with OpenAI updates on foundational models.\\n🔖 | Learn more.\\nWorkshop Model Deployments\\nOUR AZURE PLAYGROUND\\nIn this workshop we will:\\nuse the gpt-35-turbo model - for chat completions\\ndiscuss the gpt-4 model - for comparison\\nThe two main considerations to keep in mind are:\\nModel Versions - what models provide? what are the training cutoff & retirement dates?\\nQuotas and Limits - which regions are models available in? what are the model usage limits.\" }]```","chunking_table_with_phi3.output":[{"chunking":" The provided table outlines key characteristics of two language models, GPT-3.5 Turbo and GPT-4. Here's an analysis based on the available information:\n\n1. **Model (version):**\n - **gpt-3.5-turbo (0613):** This is the turbo version of GPT-3.5 with a specific commit (0613) indicating it has undergone some form of update or refinement.\n - **gpt-4 (0613):** Like the turbo version, this is the fourth iteration of the GPT model. It is similarly associated with the latest commit (0613) for this model.\n\n2. **Availability:**\n - **GPT-3.5 Turbo:** Available across 10 regions.\n - **GPT-4:** This version has slightly lower availability with access in 9 regions.\n\n3. **Request Limit:**\n - **GPT-3.5 Turbo:** Has a request limit of 4096 tokens. Tokens serve as a fundamental unit in these models, representing a fixed-size word-like unit; it allows better control and performance.\n - **GPT-4:** This model can handle a higher limit with 8192 tokens. The increased limit potentially improves its understanding and generation of text, per batch processing.\n\n4. **Training Data:**\n - Both models were trained up to September 2021; however, the information does not specify how the datasets were diversified or if any parts of their modeling were particularly improved compared to their predecessors. Further details on data analytics, lineage, or specific enhancement strategies would provide more insight.\n\nOverall, according to the outlined data, GPT-4 model offers improved capabilities with increased regional availability and token limit as compared to its predecessor, GPT-3.5 Turbo, implying that it may provide advanced functionalities, potentially at an increased cost. To derive further practical inferences, additional details about performance, cost, and user experience would be required."}],"chunking_img_with_phi3.output":[" {\"chunking\": \"The image shows a flowchart with two main steps: 'Training' and 'Evaluating'. The 'Training' step involves extracting patterns from data, which is represented by icons of a computer, a graph, and binary code. The 'Evaluating' step involves using the extracted patterns to predict results, represented by icons of a bar chart, a magnifying glass, and a computer monitor. There is an arrow indicating the flow from the 'Training' step to the 'Evaluating' step.\"}"]}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are those weird "🔖" characters?


markdown_str = ""
for row in markdown_table:
markdown_str += "| " + " | ".join(row) + " |\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use os.linesep for line separators instead of hard-coded \n characters.

]
)

# vector_search = VectorSearch(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's necessary to keep those commented-out lines, please include a comment saying why. Otherwise, please remove.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants