Add SLM Chunking samples #217

kinfey · 2024-11-03T12:15:50Z

Purpose

Add SLM chunking with Azure Search

Does this introduce a breaking change?

[ x] No

bleroy · 2024-11-04T19:55:14Z

.DS_Store

Please remove Mac files.

bleroy · 2024-11-13T19:31:16Z

SLMPhi3ChunkingSkill/README.md

+
+    ***Samples***
+
+   - Chunking Text : [Click here](./prototype/01.chunking_text.ipynb)


Please put the links directly on "chunking text", "chunking table" and "chunking image"

bleroy · 2024-11-13T19:33:00Z

SLMPhi3ChunkingSkill/README.md

+5. Complete quantization through the Azure OpenAI Service text-ada-embedding model and save it to AI Service.
+
+
+**This is SLM Trunking flow diagram**


"this is the SLM trunking flow diagram"

bleroy · 2024-11-13T19:33:18Z

SLMPhi3ChunkingSkill/README.md

+
+![slm](./imgs/SLMRAG.png)
+
+This project focuses on implementing and exploring  Chunking techniques. It's designed to enhance the efficiency and accuracy of data processing and retrieval in various applications.


Chunking -> chunking

bleroy · 2024-11-13T19:33:36Z

SLMPhi3ChunkingSkill/README.md

+
+***Samples***
+
+   - SLM Chunking FLow : [Click here](./code/slm-chunking-flow/)


Same, avoid "click here" links.

bleroy · 2024-11-13T19:39:26Z

SLMPhi3ChunkingSkill/code/slm-chunking-flow/.promptflow/set_content.inputs.jsonl

@@ -0,0 +1 @@
+{"check_json.output":" ```json[  {    \"chunking\": \"Concepts & resourcesAI Models & Deployments\\nWhat is an AI Model?\\nAn AI model (or machine learning model) is a program that has been trained on a set of data, to recognize certain types of patterns. Training the model defines an algorithm that the AI can use to reason over new data and make predictions.\\n🔖 | Learn more\\nWhat is a Large Language Model?\\nA large language model (LLM) is a type of AI that can process and produce natural language text, having been trained on massive amounts of data from diverse sources. A \\\"foundation model\\\" refers to a specific instance or version of an LLM. We'll cover these topics in more detail in the next lesson.\\n🔖 | Learn more\\n#10/31/24, 8:51 PM AI Models & Deployments | Learn how to interact with OpenAI models\\nhttps://microsoft.github.io/Workshop-Interact-with-OpenAI-models/ai-models/\\n1/4What are Embeddings?\\nAn embedding is a special data representation format that machine learning models and algorithms can use more easily. It provides an information-dense representation of the semantic meaning of text data as a vector of floating point numbers. The distance between embeddings in vector space correlates directly to the semantic similarity between their (original) text inputs.\\n🔖 | Learn more\\nEmbeddings help us use vector search methods for more efficient querying of text data. For example: it powers vector similarity search in databases like Azure Cosmos DB for MongoDB vCore. The recommended embedding model is currently text-embedding-ada-002.\\n🔖 | Learn more\\nWhat Model should I use?\\nThere are many considerations when choosing a model. Model pricing (by tokens, by artifacts), Model availability (by version, by region), Model performance (evaluation metrics), Model capability (features & parameters).\\nAs a general guide, we recommend the following: Start with gpt-35-turbo. This model is very economical and has good performance. It's commonly used for chat applications (such as OpenAI's ChatGPT) but can be used for a wide range of tasks beyond chat and conversation.\\nMove to gpt-35-turbo-16k, gpt-4 or gpt-4-32k if you need to generate more than 4,096 tokens, or need to support larger prompts. These models are more expensive and can be slower, and have limited availability, but they are the most powerful models available today.\\nConsider embeddings for tasks like search, clustering, recommendations and anomaly detection.\\n10/31/24, 8:51 PM AI Models & Deployments | Learn how to interact with OpenAI models\\nhttps://microsoft.github.io/Workshop-Interact-with-OpenAI-models/ai-models/\\n2/4Use DALL-E (Preview) for generating images from text prompts that the user provides, unlike previous models where the output is text (chat).\\nUse Whisper (Preview) for speech-to-text conversion or audio transcription. It's trained and optimized for transcribing audio files with English speech, though it can transcribe speech in other languages. The model output is in English text.\\n🔖 | Learn more\\nWhat is Azure OpenAI (AOAI)\\nOpenAI has a diverse set of language models that can \\\"generate\\\" different types of content (text, images, audio, code) from a user-provided natural language text input or \"prompt\". The Azure OpenAI Service provides access to these OpenAI models over a REST API.\\nCurrently available models include GPT-4, GPT-4 Turbo Preview, GPT-3.5, Embeddings, DALL-E (Preview) and Whisper (Preview). Azure OpenAI releases new versions regularly to keep pace with OpenAI updates on foundational models.\\n🔖 | Learn more.\\nWorkshop Model Deployments\\nOUR AZURE PLAYGROUND\\nIn this workshop we will:\\nuse the gpt-35-turbo model - for chat completions\\ndiscuss the gpt-4 model - for comparison\\nThe two main considerations to keep in mind are:\\nModel Versions - what models provide? what are the training cutoff & retirement dates?\\nQuotas and Limits - which regions are models available in? what are the model usage limits.\"  }]```","chunking_table_with_phi3.output":[{"chunking":" The provided table outlines key characteristics of two language models, GPT-3.5 Turbo and GPT-4. Here's an analysis based on the available information:\n\n1. **Model (version):**\n   - **gpt-3.5-turbo (0613):** This is the turbo version of GPT-3.5 with a specific commit (0613) indicating it has undergone some form of update or refinement.\n   - **gpt-4 (0613):** Like the turbo version, this is the fourth iteration of the GPT model. It is similarly associated with the latest commit (0613) for this model.\n\n2. **Availability:**\n   - **GPT-3.5 Turbo:** Available across 10 regions.\n   - **GPT-4:** This version has slightly lower availability with access in 9 regions.\n\n3. **Request Limit:**\n   - **GPT-3.5 Turbo:** Has a request limit of 4096 tokens. Tokens serve as a fundamental unit in these models, representing a fixed-size word-like unit; it allows better control and performance.\n   - **GPT-4:** This model can handle a higher limit with 8192 tokens. The increased limit potentially improves its understanding and generation of text, per batch processing.\n\n4. **Training Data:**\n   - Both models were trained up to September 2021; however, the information does not specify how the datasets were diversified or if any parts of their modeling were particularly improved compared to their predecessors. Further details on data analytics, lineage, or specific enhancement strategies would provide more insight.\n\nOverall, according to the outlined data, GPT-4 model offers improved capabilities with increased regional availability and token limit as compared to its predecessor, GPT-3.5 Turbo, implying that it may provide advanced functionalities, potentially at an increased cost. To derive further practical inferences, additional details about performance, cost, and user experience would be required."}],"chunking_img_with_phi3.output":[" {\"chunking\": \"The image shows a flowchart with two main steps: 'Training' and 'Evaluating'. The 'Training' step involves extracting patterns from data, which is represented by icons of a computer, a graph, and binary code. The 'Evaluating' step involves using the extracted patterns to predict results, represented by icons of a bar chart, a magnifying glass, and a computer monitor. There is an arrow indicating the flow from the 'Training' step to the 'Evaluating' step.\"}"]}


What are those weird "🔖" characters?

bleroy · 2024-11-13T19:47:17Z

SLMPhi3ChunkingSkill/code/slm-chunking-flow/get_pdf_table.py

+
+    markdown_str = ""
+    for row in markdown_table:
+        markdown_str += "| " + " | ".join(row) + " |\n"


Please use os.linesep for line separators instead of hard-coded \n characters.

bleroy · 2024-11-13T19:50:40Z

SLMPhi3ChunkingSkill/code/slm-chunking-flow/save_to_ai_search_vectordb.py

+    ]
+  )
+
+#   vector_search = VectorSearch(


If it's necessary to keep those commented-out lines, please include a comment saying why. Otherwise, please remove.

Add SLM Chunking samples

6e4e254

bleroy reviewed Nov 13, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SLM Chunking samples #217

Add SLM Chunking samples #217

kinfey commented Nov 3, 2024

bleroy Nov 4, 2024

bleroy Nov 13, 2024

bleroy Nov 13, 2024

bleroy Nov 13, 2024

bleroy Nov 13, 2024

bleroy Nov 13, 2024

bleroy Nov 13, 2024

bleroy Nov 13, 2024


		*Samples*

		- Chunking Text : [Click here](./prototype/01.chunking_text.ipynb)

		5. Complete quantization through the Azure OpenAI Service text-ada-embedding model and save it to AI Service.


		This is SLM Trunking flow diagram


		![slm](./imgs/SLMRAG.png)

		This project focuses on implementing and exploring Chunking techniques. It's designed to enhance the efficiency and accuracy of data processing and retrieval in various applications.


		*Samples*

		- SLM Chunking FLow : [Click here](./code/slm-chunking-flow/)

		@@ -0,0 +1 @@
		{"check_json.output":" ```json[ { \"chunking\": \"Concepts & resourcesAI Models & Deployments\\nWhat is an AI Model?\\nAn AI model (or machine learning model) is a program that has been trained on a set of data, to recognize certain types of patterns. Training the model defines an algorithm that the AI can use to reason over new data and make predictions.\\n🔖 \| Learn more\\nWhat is a Large Language Model?\\nA large language model (LLM) is a type of AI that can process and produce natural language text, having been trained on massive amounts of data from diverse sources. A \\\"foundation model\\\" refers to a specific instance or version of an LLM. We'll cover these topics in more detail in the next lesson.\\n🔖 \| Learn more\\n#10/31/24, 8:51 PM AI Models & Deployments \| Learn how to interact with OpenAI models\\nhttps://microsoft.github.io/Workshop-Interact-with-OpenAI-models/ai-models/\\n1/4What are Embeddings?\\nAn embedding is a special data representation format that machine learning models and algorithms can use more easily. It provides an information-dense representation of the semantic meaning of text data as a vector of floating point numbers. The distance between embeddings in vector space correlates directly to the semantic similarity between their (original) text inputs.\\n🔖 \| Learn more\\nEmbeddings help us use vector search methods for more efficient querying of text data. For example: it powers vector similarity search in databases like Azure Cosmos DB for MongoDB vCore. The recommended embedding model is currently text-embedding-ada-002.\\n🔖 \| Learn more\\nWhat Model should I use?\\nThere are many considerations when choosing a model. Model pricing (by tokens, by artifacts), Model availability (by version, by region), Model performance (evaluation metrics), Model capability (features & parameters).\\nAs a general guide, we recommend the following: Start with gpt-35-turbo. This model is very economical and has good performance. It's commonly used for chat applications (such as OpenAI's ChatGPT) but can be used for a wide range of tasks beyond chat and conversation.\\nMove to gpt-35-turbo-16k, gpt-4 or gpt-4-32k if you need to generate more than 4,096 tokens, or need to support larger prompts. These models are more expensive and can be slower, and have limited availability, but they are the most powerful models available today.\\nConsider embeddings for tasks like search, clustering, recommendations and anomaly detection.\\n10/31/24, 8:51 PM AI Models & Deployments \| Learn how to interact with OpenAI models\\nhttps://microsoft.github.io/Workshop-Interact-with-OpenAI-models/ai-models/\\n2/4Use DALL-E (Preview) for generating images from text prompts that the user provides, unlike previous models where the output is text (chat).\\nUse Whisper (Preview) for speech-to-text conversion or audio transcription. It's trained and optimized for transcribing audio files with English speech, though it can transcribe speech in other languages. The model output is in English text.\\n🔖 \| Learn more\\nWhat is Azure OpenAI (AOAI)\\nOpenAI has a diverse set of language models that can \\\"generate\\\" different types of content (text, images, audio, code) from a user-provided natural language text input or \"prompt\". The Azure OpenAI Service provides access to these OpenAI models over a REST API.\\nCurrently available models include GPT-4, GPT-4 Turbo Preview, GPT-3.5, Embeddings, DALL-E (Preview) and Whisper (Preview). Azure OpenAI releases new versions regularly to keep pace with OpenAI updates on foundational models.\\n🔖 \| Learn more.\\nWorkshop Model Deployments\\nOUR AZURE PLAYGROUND\\nIn this workshop we will:\\nuse the gpt-35-turbo model - for chat completions\\ndiscuss the gpt-4 model - for comparison\\nThe two main considerations to keep in mind are:\\nModel Versions - what models provide? what are the training cutoff & retirement dates?\\nQuotas and Limits - which regions are models available in? what are the model usage limits.\" }]```","chunking_table_with_phi3.output":[{"chunking":" The provided table outlines key characteristics of two language models, GPT-3.5 Turbo and GPT-4. Here's an analysis based on the available information:\n\n1. Model (version):\n - gpt-3.5-turbo (0613): This is the turbo version of GPT-3.5 with a specific commit (0613) indicating it has undergone some form of update or refinement.\n - gpt-4 (0613): Like the turbo version, this is the fourth iteration of the GPT model. It is similarly associated with the latest commit (0613) for this model.\n\n2. Availability:\n - GPT-3.5 Turbo: Available across 10 regions.\n - GPT-4: This version has slightly lower availability with access in 9 regions.\n\n3. Request Limit:\n - GPT-3.5 Turbo: Has a request limit of 4096 tokens. Tokens serve as a fundamental unit in these models, representing a fixed-size word-like unit; it allows better control and performance.\n - GPT-4: This model can handle a higher limit with 8192 tokens. The increased limit potentially improves its understanding and generation of text, per batch processing.\n\n4. Training Data:\n - Both models were trained up to September 2021; however, the information does not specify how the datasets were diversified or if any parts of their modeling were particularly improved compared to their predecessors. Further details on data analytics, lineage, or specific enhancement strategies would provide more insight.\n\nOverall, according to the outlined data, GPT-4 model offers improved capabilities with increased regional availability and token limit as compared to its predecessor, GPT-3.5 Turbo, implying that it may provide advanced functionalities, potentially at an increased cost. To derive further practical inferences, additional details about performance, cost, and user experience would be required."}],"chunking_img_with_phi3.output":[" {\"chunking\": \"The image shows a flowchart with two main steps: 'Training' and 'Evaluating'. The 'Training' step involves extracting patterns from data, which is represented by icons of a computer, a graph, and binary code. The 'Evaluating' step involves using the extracted patterns to predict results, represented by icons of a bar chart, a magnifying glass, and a computer monitor. There is an arrow indicating the flow from the 'Training' step to the 'Evaluating' step.\"}"]}

Add SLM Chunking samples #217

Are you sure you want to change the base?

Add SLM Chunking samples #217

Conversation

kinfey commented Nov 3, 2024

Purpose

Does this introduce a breaking change?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment