Apply automatic changes

impresso · Oct 29, 2024 · 01d0383 · 01d0383
1 parent e5bcdc7
commit 01d0383
Show file tree

Hide file tree

Showing 5 changed files with 141 additions and 101 deletions.
diff --git a/src/content/notebooks/language-identification-with-impresso-hf.mdx b/src/content/notebooks/language-identification-with-impresso-hf.mdx
@@ -3,8 +3,8 @@ title: Language Identification using Floret
 githubUrl: https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/language-identification_ImpressoHF.ipynb
 authors:
   - impresso-team
-sha: d56a2704ce8f28179bd1c7ce592b0e55b6ac5866
-date: 2024-10-24T16:21:08Z
+sha: 7b580a8b2a279aa6244afe7ff5d6dbc54fb11ac4
+date: 2024-10-27T14:04:06Z
 googleColabUrl: https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/language-identification_ImpressoHF.ipynb
 links: []
 excerpt: This notebook demonstrates language identification using a pre-trained
@@ -17,34 +17,49 @@ excerpt: This notebook demonstrates language identification using a pre-trained
 ---
 
 {/* cell:0 cell_type:markdown */}
-<a href="https://colab.research.google.com/github/flipz357/impresso-datalab-notebooks/blob/main/3.1-language-identification/floret-language-identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
+<a href="https://colab.research.google.com/github/flipz357/impresso-datalab-notebooks/blob/main/annotate/language-identification_ImpressoHF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
 
 {/* cell:1 cell_type:markdown */}
+
+This notebook demonstrates how to use a pre-trained Floret language identification model downloaded from Hugging Face.
+We'll load the model, input some text, and predict the language of the text.
+
+## What is this notebook about?
+This notebook provides a hands-on demonstration of **language identification** (LID) using our Impresso LID model from Hugging Face. We will explore how to download and use this model to predict the language of Impresso-like text inputs. This notebook walks through the necessary steps to set up dependencies, load the model, and implement it for practical language identification tasks.
+
+## What will you learn in this notebook?
+By the end of this notebook, you will:
+- Understand how to install and configure the required libraries (`floret` and `huggingface_hub`).
+- Learn to load our trained Floret language identification model from Hugging Face.
+- Run the model to predict the dominant language (or the mix of languages) of a given text input.
+- Gain insight into the core functionality of language identification using machine learning models.
+
+{/* cell:2 cell_type:markdown */}
 ## 1. Install Dependencies
 
 First, we need to install `floret` and `huggingface_hub` to work with the Floret language identification model and Hugging Face.
 
 
-{/* cell:2 cell_type:code */}
+{/* cell:3 cell_type:code */}
 ```python
 !pip install floret
 !pip install huggingface_hub
 ```
 
-{/* cell:3 cell_type:markdown */}
+{/* cell:4 cell_type:markdown */}
 ## 2. Model Information
 
 In this example, we are using a language identification model hosted on the Hugging Face Hub: `impresso-project/impresso-floret-langident`.
 The model can predict the language of a given text of a reasonable length and supports the main impresso languages: German (de), French (fr), Luxemburgish (lb), Italian (it), English (en)
 
 
-{/* cell:4 cell_type:markdown */}
+{/* cell:5 cell_type:markdown */}
 ## 3. Defining the FloretLangIdentifier Class
 
 This class downloads the Floret model from Hugging Face and loads it for prediction. We use `huggingface_hub` to download the model locally.
 
 
-{/* cell:5 cell_type:code */}
+{/* cell:6 cell_type:code */}
 ```python
 from huggingface_hub import hf_hub_download
 import floret
@@ -130,16 +145,16 @@ class FloretLangIdentifier:
             return None
 ```
 
-{/* cell:6 cell_type:markdown */}
+{/* cell:7 cell_type:markdown */}
 ## 4. Using the Model for Prediction
 
 Now that the model is loaded, you can input your own text and predict the language.
 
 
-{/* cell:7 cell_type:markdown */}
+{/* cell:8 cell_type:markdown */}
 ### 4.1 Predict the main language of a document
 
-{/* cell:8 cell_type:code */}
+{/* cell:9 cell_type:code */}
 ```python
 # Define the repository and model file
 repo_id = "impresso-project/impresso-floret-langident"
@@ -156,11 +171,11 @@ result = model.predict_language(text)
 print("Language:", result)
 ```
 
-{/* cell:9 cell_type:markdown */}
+{/* cell:10 cell_type:markdown */}
 ### 4.2 Predict the language mix of a document
 
 
-{/* cell:10 cell_type:code */}
+{/* cell:11 cell_type:code */}
 ```python
 # Multi-output for predicting mixed-language documents
 # Example text for prediction
@@ -171,29 +186,43 @@ result = model.predict_language_mix(text)
 print("Language mix:", result)
 ```
 
-{/* cell:11 cell_type:markdown */}
-### 4.3 Interactive mode
+{/* cell:12 cell_type:markdown */}
+### 4.3 Predict the language mix of an impresso document
+
 
-{/* cell:12 cell_type:code */}
+{/* cell:13 cell_type:code */}
+```python
+# source: https://impresso-project.ch/app/issue/onsjongen-1945-03-03-a/view?p=1&articleId=i0001&text=1
+text = " Lëtzeburger Zaldoten traine'èren an England Soldats luxembourgeois à l’entraînement en Angleterre"
+
+# Predict the language
+result = model.predict_language_mix(text)
+print("Language mix:", result)
+```
+
+{/* cell:14 cell_type:markdown */}
+### 4.4 Interactive mode
+
+{/* cell:15 cell_type:code */}
 ```python
 # Interactive text input
 text = input("Enter a sentence for language identification: ")
 result = model.predict_language_mix(text)
 print("Prediction Result:", result)
 ```
 
-{/* cell:13 cell_type:markdown */}
-## 4. Why is Language identification important? An example
+{/* cell:16 cell_type:markdown */}
+## 5. Why is Language identification important? An example
 
 Many NLP models are trained on data from certain languages. For applying any further NLP processing, we often need to know the language.
 
 Let us visit a concrete example: Say that we want to count the nouns in a text. For this we load a NLP-processor from the popular spacy-library, that (i.a.) splits the text and tags our words with so-called part-of-speech-tags.
 
 
-{/* cell:14 cell_type:markdown */}
-### 4.1 Build a simple Noun counter class
+{/* cell:17 cell_type:markdown */}
+### 5.1 Build a simple Noun counter class
 
-{/* cell:15 cell_type:code */}
+{/* cell:18 cell_type:code */}
 ```python
 class NounCounter:
 
@@ -224,10 +253,10 @@ class NounCounter:
         return noun_count
 ```
 
-{/* cell:16 cell_type:markdown */}
-### 4.2 Noun counter: A first naive test
+{/* cell:19 cell_type:markdown */}
+### 5.2 Noun counter: A first naive test
 
-{/* cell:17 cell_type:code */}
+{/* cell:20 cell_type:code */}
 ```python
 # Example text for prediction
 text = "Das ist ein Testdokument. Ein Mann geht mit einem Hund im Park spazieren."
@@ -245,15 +274,15 @@ counter = NounCounter(nlp)
 print("Text: \"{}\"\nNoun-count: {}".format(text, counter.count_nouns(text)))
 ```
 
-{/* cell:18 cell_type:markdown */}
-### 4.3 Noun counter: A second test
+{/* cell:21 cell_type:markdown */}
+### 5.3 Noun counter: A second test
 
-{/* cell:19 cell_type:markdown */}
+{/* cell:22 cell_type:markdown */}
 Now let us assume that we would know the language of the input document: German.
 
 This would let us load a default German spacy model.
 
-{/* cell:20 cell_type:code */}
+{/* cell:23 cell_type:code */}
 ```python
 # Need to download the German model
 spacy.cli.download("de_core_news_sm")
@@ -268,14 +297,14 @@ counter = NounCounter(nlp)
 print("Text: \"{}\"\nNoun-count: {}".format(text, counter.count_nouns(text)))
 ```
 
-{/* cell:21 cell_type:markdown */}
-### 4.4 Noun counter: Combining our knowledge
+{/* cell:24 cell_type:markdown */}
+### 5.4 Noun counter: Combining our knowledge
 
 
-{/* cell:22 cell_type:markdown */}
+{/* cell:25 cell_type:markdown */}
 We use our insights to build a language-informed spacy loader that uses our language identifier!
 
-{/* cell:23 cell_type:code */}
+{/* cell:26 cell_type:code */}
 ```python
 class LanguageAwareSpacyLoader:
 
@@ -322,10 +351,10 @@ class LanguageAwareSpacyLoader:
 
 ```
 
-{/* cell:24 cell_type:markdown */}
+{/* cell:27 cell_type:markdown */}
 Let's try it
 
-{/* cell:25 cell_type:code */}
+{/* cell:28 cell_type:code */}
 ```python
 # We initialize our language aware spacy loader
 loader = LanguageAwareSpacyLoader(model)
@@ -340,20 +369,20 @@ counter = NounCounter(nlp)
 print("Noun-count: {}".format(counter.count_nouns(text)))
 ```
 
-{/* cell:26 cell_type:markdown */}
+{/* cell:29 cell_type:markdown */}
 Let's start the interactive mode again. Input any text in some language, and the two-step model (lang-id + nlp) will count its nouns.
 
 
-{/* cell:27 cell_type:code */}
+{/* cell:30 cell_type:code */}
 ```python
 text = input("Enter a sentence for Noun counting: ")
 nlp = loader.load(text)
 counter = NounCounter(nlp)
 print("Noun-count: {}".format(counter.count_nouns(text)))
 ```
 
-{/* cell:28 cell_type:markdown */}
-## 5. Summary and Next Steps
+{/* cell:31 cell_type:markdown */}
+## 6. Summary and Next Steps
 
 In this notebook, we used a pre-trained Floret language identification model to predict the language of input text. You can modify the input or explore other models from Hugging Face.
 

diff --git a/src/content/notebooks/ne-processing-with-impresso-api.mdx b/src/content/notebooks/ne-processing-with-impresso-api.mdx
@@ -3,8 +3,8 @@ title: Named Entity Processing with Impresso Models through the Impresso API
 githubUrl: https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoAPI.ipynb
 authors:
   - impresso-team
-sha: 44a3c9f14c74807de3722878701d97ed71fa3e05
-date: 2024-10-25T14:18:01Z
+sha: cc20b1b70db4da2aea4042c0e8d82a52e6ffb762
+date: 2024-10-27T13:47:15Z
 googleColabUrl: https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoAPI.ipynb
 links:
   - href: https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoHF.ipynb
@@ -19,43 +19,49 @@ seealso:
 
 {/* cell:0 cell_type:markdown */}
 
+<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoAPI.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+</a>
+
+{/* cell:1 cell_type:markdown */}
 ## What is this notebook about?
 
-This notebook is similar to the [NE-processing_ImpressoHF](https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoHF.ipynb) one, except that instead of loading the model from Hugging Face and executing them locally (or on Colab), here we use the annotation functionalities provided by the Impresso API, using the Impresso Python Library. Behind the scene the same models are used.
+This notebook is similar to the [NE-processing_ImpressoHF](https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoHF.ipynb) one, except that instead of loading the model from Hugging Face and executing them locally (or on Colab), here we use the annotation functionalities provided by the Impresso API, using the Impresso Python Library. Behind the scene the same models are used. 
 
 For more information on the models, please refer to the [NE-processing_ImpressoHF](https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoHF.ipynb) notebook (we advised starting with it).
 
 For an introduction to the Impresso Python Library, please refer to the [basics_ImpressoAPI](https://github.com/impresso/impresso-datalab-notebooks/blob/main/starter/basics_ImpressoAPI.ipynb).
 
 ## What will you learn in this notebook?
-
 By the end of this notebook, you will know how to call the NER and EL Impresso annotation services through the Impresso API, using the Impresso Python Library
 
-{/* cell:1 cell_type:code */}
-
+{/* cell:2 cell_type:code */}
 ```python
 !pip install --upgrade --force-reinstall impresso
 from impresso import version
 print(version)
 ```
 
-{/* cell:2 cell_type:code */}
-
+{/* cell:3 cell_type:code */}
 ```python
 from impresso import connect
 impresso_session = connect()
 ```
 
-{/* cell:3 cell_type:markdown */}
-
+{/* cell:4 cell_type:markdown */}
 ## Named entity recognition
 
-{/* cell:4 cell_type:code */}
-
+{/* cell:5 cell_type:code */}
 ```python
-text = """
-Hugging Face will offer the product through Amazon and Google's cloud computing services for $1 per hour and on Digital Ocean, a specialty cloud computing company. Companies will also be able to download the Hugging Face offering to run in their own data centers.
-"""
+# We define some test input
+text = """In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles, 
+        where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly, 
+        debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun, 
+        regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia, 
+        George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State, 
+        were drafting policies for the newly established American government following the signing of the Constitution."""
+
+print(text)
 
 result = impresso_session.tools.ner(
     text=text
@@ -64,46 +70,39 @@ result = impresso_session.tools.ner(
 result.df.tail(10)
 ```
 
-{/* cell:5 cell_type:markdown */}
-
+{/* cell:6 cell_type:markdown */}
 ## Named entity linking
 
-{/* cell:6 cell_type:code */}
-
+{/* cell:7 cell_type:code */}
 ```python
-text = """
-Hugging Face will offer the product through [START] Amazon [END] and Google's cloud computing services for $1 per hour and on Digital Ocean, a specialty cloud computing company. Companies will also be able to download the Hugging Face offering to run in their own data centers.
-"""
-result = impresso_session.tools.nel(
-    text=text
-)
-result
-```
+# We define some test input
+text_with_markers = """In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles, 
+                where [START] Marie Antoinette, the Queen of France [END], alongside Maximilien Robespierre, a leading member of the National Assembly, 
+                debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun, 
+                regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia, 
+                George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State, 
+                were drafting policies for the newly established American government following the signing of the Constitution."""
 
-{/* cell:7 cell_type:code */}
+print(text_with_markers)
 
-```python
-text = """
- Hugging Face proposera le produit via les services de cloud computing d'[START] Amazon [END] et de Google pour 1 dollar par heure, ainsi que sur Digital Ocean, une entreprise spécialisée dans le cloud computing. Les entreprises pourront également télécharger l'offre de Hugging Face pour l'exécuter dans leurs propres centres de données.
- """
 result = impresso_session.tools.nel(
-     text=text
+    text=text_with_markers
 )
-result.df
+result
 ```
 
 {/* cell:8 cell_type:markdown */}
-
 ## Named entity processing
 
 {/* cell:9 cell_type:code */}
-
 ```python
-text = """
-Hugging Face will offer the product through Amazon and Google's cloud computing services for $1 per hour and on Digital Ocean, a specialty cloud computing company. Companies will also be able to download the Hugging Face offering to run in their own data centers.
-"""
 result = impresso_session.tools.ner_nel(
     text=text
 )
 result.df
 ```
+
+{/* cell:10 cell_type:code */}
+```python
+
+```