Skip to content

Commit

Permalink
Apply automatic changes
Browse files Browse the repository at this point in the history
  • Loading branch information
danieleguido authored and github-actions[bot] committed Oct 29, 2024
1 parent e5bcdc7 commit 01d0383
Show file tree
Hide file tree
Showing 5 changed files with 141 additions and 101 deletions.
103 changes: 66 additions & 37 deletions src/content/notebooks/language-identification-with-impresso-hf.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ title: Language Identification using Floret
githubUrl: https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/language-identification_ImpressoHF.ipynb
authors:
- impresso-team
sha: d56a2704ce8f28179bd1c7ce592b0e55b6ac5866
date: 2024-10-24T16:21:08Z
sha: 7b580a8b2a279aa6244afe7ff5d6dbc54fb11ac4
date: 2024-10-27T14:04:06Z
googleColabUrl: https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/language-identification_ImpressoHF.ipynb
links: []
excerpt: This notebook demonstrates language identification using a pre-trained
Expand All @@ -17,34 +17,49 @@ excerpt: This notebook demonstrates language identification using a pre-trained
---

{/* cell:0 cell_type:markdown */}
<a href="https://colab.research.google.com/github/flipz357/impresso-datalab-notebooks/blob/main/3.1-language-identification/floret-language-identification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<a href="https://colab.research.google.com/github/flipz357/impresso-datalab-notebooks/blob/main/annotate/language-identification_ImpressoHF.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

{/* cell:1 cell_type:markdown */}

This notebook demonstrates how to use a pre-trained Floret language identification model downloaded from Hugging Face.
We'll load the model, input some text, and predict the language of the text.

## What is this notebook about?
This notebook provides a hands-on demonstration of **language identification** (LID) using our Impresso LID model from Hugging Face. We will explore how to download and use this model to predict the language of Impresso-like text inputs. This notebook walks through the necessary steps to set up dependencies, load the model, and implement it for practical language identification tasks.

## What will you learn in this notebook?
By the end of this notebook, you will:
- Understand how to install and configure the required libraries (`floret` and `huggingface_hub`).
- Learn to load our trained Floret language identification model from Hugging Face.
- Run the model to predict the dominant language (or the mix of languages) of a given text input.
- Gain insight into the core functionality of language identification using machine learning models.

{/* cell:2 cell_type:markdown */}
## 1. Install Dependencies

First, we need to install `floret` and `huggingface_hub` to work with the Floret language identification model and Hugging Face.


{/* cell:2 cell_type:code */}
{/* cell:3 cell_type:code */}
```python
!pip install floret
!pip install huggingface_hub
```

{/* cell:3 cell_type:markdown */}
{/* cell:4 cell_type:markdown */}
## 2. Model Information

In this example, we are using a language identification model hosted on the Hugging Face Hub: `impresso-project/impresso-floret-langident`.
The model can predict the language of a given text of a reasonable length and supports the main impresso languages: German (de), French (fr), Luxemburgish (lb), Italian (it), English (en)


{/* cell:4 cell_type:markdown */}
{/* cell:5 cell_type:markdown */}
## 3. Defining the FloretLangIdentifier Class

This class downloads the Floret model from Hugging Face and loads it for prediction. We use `huggingface_hub` to download the model locally.


{/* cell:5 cell_type:code */}
{/* cell:6 cell_type:code */}
```python
from huggingface_hub import hf_hub_download
import floret
Expand Down Expand Up @@ -130,16 +145,16 @@ class FloretLangIdentifier:
return None
```

{/* cell:6 cell_type:markdown */}
{/* cell:7 cell_type:markdown */}
## 4. Using the Model for Prediction

Now that the model is loaded, you can input your own text and predict the language.


{/* cell:7 cell_type:markdown */}
{/* cell:8 cell_type:markdown */}
### 4.1 Predict the main language of a document

{/* cell:8 cell_type:code */}
{/* cell:9 cell_type:code */}
```python
# Define the repository and model file
repo_id = "impresso-project/impresso-floret-langident"
Expand All @@ -156,11 +171,11 @@ result = model.predict_language(text)
print("Language:", result)
```

{/* cell:9 cell_type:markdown */}
{/* cell:10 cell_type:markdown */}
### 4.2 Predict the language mix of a document


{/* cell:10 cell_type:code */}
{/* cell:11 cell_type:code */}
```python
# Multi-output for predicting mixed-language documents
# Example text for prediction
Expand All @@ -171,29 +186,43 @@ result = model.predict_language_mix(text)
print("Language mix:", result)
```

{/* cell:11 cell_type:markdown */}
### 4.3 Interactive mode
{/* cell:12 cell_type:markdown */}
### 4.3 Predict the language mix of an impresso document


{/* cell:12 cell_type:code */}
{/* cell:13 cell_type:code */}
```python
# source: https://impresso-project.ch/app/issue/onsjongen-1945-03-03-a/view?p=1&articleId=i0001&text=1
text = " Lëtzeburger Zaldoten traine'èren an England Soldats luxembourgeois à l’entraînement en Angleterre"

# Predict the language
result = model.predict_language_mix(text)
print("Language mix:", result)
```

{/* cell:14 cell_type:markdown */}
### 4.4 Interactive mode

{/* cell:15 cell_type:code */}
```python
# Interactive text input
text = input("Enter a sentence for language identification: ")
result = model.predict_language_mix(text)
print("Prediction Result:", result)
```

{/* cell:13 cell_type:markdown */}
## 4. Why is Language identification important? An example
{/* cell:16 cell_type:markdown */}
## 5. Why is Language identification important? An example

Many NLP models are trained on data from certain languages. For applying any further NLP processing, we often need to know the language.

Let us visit a concrete example: Say that we want to count the nouns in a text. For this we load a NLP-processor from the popular spacy-library, that (i.a.) splits the text and tags our words with so-called part-of-speech-tags.


{/* cell:14 cell_type:markdown */}
### 4.1 Build a simple Noun counter class
{/* cell:17 cell_type:markdown */}
### 5.1 Build a simple Noun counter class

{/* cell:15 cell_type:code */}
{/* cell:18 cell_type:code */}
```python
class NounCounter:

Expand Down Expand Up @@ -224,10 +253,10 @@ class NounCounter:
return noun_count
```

{/* cell:16 cell_type:markdown */}
### 4.2 Noun counter: A first naive test
{/* cell:19 cell_type:markdown */}
### 5.2 Noun counter: A first naive test

{/* cell:17 cell_type:code */}
{/* cell:20 cell_type:code */}
```python
# Example text for prediction
text = "Das ist ein Testdokument. Ein Mann geht mit einem Hund im Park spazieren."
Expand All @@ -245,15 +274,15 @@ counter = NounCounter(nlp)
print("Text: \"{}\"\nNoun-count: {}".format(text, counter.count_nouns(text)))
```

{/* cell:18 cell_type:markdown */}
### 4.3 Noun counter: A second test
{/* cell:21 cell_type:markdown */}
### 5.3 Noun counter: A second test

{/* cell:19 cell_type:markdown */}
{/* cell:22 cell_type:markdown */}
Now let us assume that we would know the language of the input document: German.

This would let us load a default German spacy model.

{/* cell:20 cell_type:code */}
{/* cell:23 cell_type:code */}
```python
# Need to download the German model
spacy.cli.download("de_core_news_sm")
Expand All @@ -268,14 +297,14 @@ counter = NounCounter(nlp)
print("Text: \"{}\"\nNoun-count: {}".format(text, counter.count_nouns(text)))
```

{/* cell:21 cell_type:markdown */}
### 4.4 Noun counter: Combining our knowledge
{/* cell:24 cell_type:markdown */}
### 5.4 Noun counter: Combining our knowledge


{/* cell:22 cell_type:markdown */}
{/* cell:25 cell_type:markdown */}
We use our insights to build a language-informed spacy loader that uses our language identifier!

{/* cell:23 cell_type:code */}
{/* cell:26 cell_type:code */}
```python
class LanguageAwareSpacyLoader:

Expand Down Expand Up @@ -322,10 +351,10 @@ class LanguageAwareSpacyLoader:

```

{/* cell:24 cell_type:markdown */}
{/* cell:27 cell_type:markdown */}
Let's try it

{/* cell:25 cell_type:code */}
{/* cell:28 cell_type:code */}
```python
# We initialize our language aware spacy loader
loader = LanguageAwareSpacyLoader(model)
Expand All @@ -340,20 +369,20 @@ counter = NounCounter(nlp)
print("Noun-count: {}".format(counter.count_nouns(text)))
```

{/* cell:26 cell_type:markdown */}
{/* cell:29 cell_type:markdown */}
Let's start the interactive mode again. Input any text in some language, and the two-step model (lang-id + nlp) will count its nouns.


{/* cell:27 cell_type:code */}
{/* cell:30 cell_type:code */}
```python
text = input("Enter a sentence for Noun counting: ")
nlp = loader.load(text)
counter = NounCounter(nlp)
print("Noun-count: {}".format(counter.count_nouns(text)))
```

{/* cell:28 cell_type:markdown */}
## 5. Summary and Next Steps
{/* cell:31 cell_type:markdown */}
## 6. Summary and Next Steps

In this notebook, we used a pre-trained Floret language identification model to predict the language of input text. You can modify the input or explore other models from Hugging Face.

Expand Down
77 changes: 38 additions & 39 deletions src/content/notebooks/ne-processing-with-impresso-api.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ title: Named Entity Processing with Impresso Models through the Impresso API
githubUrl: https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoAPI.ipynb
authors:
- impresso-team
sha: 44a3c9f14c74807de3722878701d97ed71fa3e05
date: 2024-10-25T14:18:01Z
sha: cc20b1b70db4da2aea4042c0e8d82a52e6ffb762
date: 2024-10-27T13:47:15Z
googleColabUrl: https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoAPI.ipynb
links:
- href: https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoHF.ipynb
Expand All @@ -19,43 +19,49 @@ seealso:

{/* cell:0 cell_type:markdown */}

<a target="_blank" href="https://colab.research.google.com/github/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoAPI.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

{/* cell:1 cell_type:markdown */}
## What is this notebook about?

This notebook is similar to the [NE-processing_ImpressoHF](https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoHF.ipynb) one, except that instead of loading the model from Hugging Face and executing them locally (or on Colab), here we use the annotation functionalities provided by the Impresso API, using the Impresso Python Library. Behind the scene the same models are used.
This notebook is similar to the [NE-processing_ImpressoHF](https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoHF.ipynb) one, except that instead of loading the model from Hugging Face and executing them locally (or on Colab), here we use the annotation functionalities provided by the Impresso API, using the Impresso Python Library. Behind the scene the same models are used.

For more information on the models, please refer to the [NE-processing_ImpressoHF](https://github.com/impresso/impresso-datalab-notebooks/blob/main/annotate/NE-processing_ImpressoHF.ipynb) notebook (we advised starting with it).

For an introduction to the Impresso Python Library, please refer to the [basics_ImpressoAPI](https://github.com/impresso/impresso-datalab-notebooks/blob/main/starter/basics_ImpressoAPI.ipynb).

## What will you learn in this notebook?

By the end of this notebook, you will know how to call the NER and EL Impresso annotation services through the Impresso API, using the Impresso Python Library

{/* cell:1 cell_type:code */}

{/* cell:2 cell_type:code */}
```python
!pip install --upgrade --force-reinstall impresso
from impresso import version
print(version)
```

{/* cell:2 cell_type:code */}

{/* cell:3 cell_type:code */}
```python
from impresso import connect
impresso_session = connect()
```

{/* cell:3 cell_type:markdown */}

{/* cell:4 cell_type:markdown */}
## Named entity recognition

{/* cell:4 cell_type:code */}

{/* cell:5 cell_type:code */}
```python
text = """
Hugging Face will offer the product through Amazon and Google's cloud computing services for $1 per hour and on Digital Ocean, a specialty cloud computing company. Companies will also be able to download the Hugging Face offering to run in their own data centers.
"""
# We define some test input
text = """In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles,
where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly,
debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun,
regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia,
George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State,
were drafting policies for the newly established American government following the signing of the Constitution."""

print(text)

result = impresso_session.tools.ner(
text=text
Expand All @@ -64,46 +70,39 @@ result = impresso_session.tools.ner(
result.df.tail(10)
```

{/* cell:5 cell_type:markdown */}

{/* cell:6 cell_type:markdown */}
## Named entity linking

{/* cell:6 cell_type:code */}

{/* cell:7 cell_type:code */}
```python
text = """
Hugging Face will offer the product through [START] Amazon [END] and Google's cloud computing services for $1 per hour and on Digital Ocean, a specialty cloud computing company. Companies will also be able to download the Hugging Face offering to run in their own data centers.
"""
result = impresso_session.tools.nel(
text=text
)
result
```
# We define some test input
text_with_markers = """In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles,
where [START] Marie Antoinette, the Queen of France [END], alongside Maximilien Robespierre, a leading member of the National Assembly,
debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun,
regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia,
George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State,
were drafting policies for the newly established American government following the signing of the Constitution."""

{/* cell:7 cell_type:code */}
print(text_with_markers)

```python
text = """
Hugging Face proposera le produit via les services de cloud computing d'[START] Amazon [END] et de Google pour 1 dollar par heure, ainsi que sur Digital Ocean, une entreprise spécialisée dans le cloud computing. Les entreprises pourront également télécharger l'offre de Hugging Face pour l'exécuter dans leurs propres centres de données.
"""
result = impresso_session.tools.nel(
text=text
text=text_with_markers
)
result.df
result
```

{/* cell:8 cell_type:markdown */}

## Named entity processing

{/* cell:9 cell_type:code */}

```python
text = """
Hugging Face will offer the product through Amazon and Google's cloud computing services for $1 per hour and on Digital Ocean, a specialty cloud computing company. Companies will also be able to download the Hugging Face offering to run in their own data centers.
"""
result = impresso_session.tools.ner_nel(
text=text
)
result.df
```

{/* cell:10 cell_type:code */}
```python

```
Loading

0 comments on commit 01d0383

Please sign in to comment.