This API is part of the document's pseudonymization effort lead at Etalab's Lab IA. Other Lab IA projects can be found at the Lab IA.
The purpose of this repo is to provide an API endpoint to the pseudonymize documents. The API should make it easy to developers to automoate document pseudonnymization with their own models.
The larger goal of the pseudonymization project is to help the French Counsil of State open their court decisions to the general public, as required by law. More info about pseudonymization and this project can be found in our French pseudonymization guide here. Our API uses a Named Entity Recognition model to find and replace first names, last names, and addresses in court decisions (specifically those of the Counsil of State).
You need to train a NER model with the Flair library. Unfortunately, currently we cannot share our model nor the data it was trained on as it contains non-public information.
- Natural Language Processing: Information Extraction : Named Entity Recognition
- Natural Language Processing: Language Modelling / Feature Learning: Word embeddings
- Machine Learning: Deep Learning: Recurrent Networks: BiLSTM+CRF
- Python
- Flair, sacremoses
- Flask, gunicorn, nginx
- SQLite
- Pandas
- Docker
The API has two endpoints:
Analyzes a given string. The output is decided by the string passed to the output_type
field. It may be one of {pseudonymized, tagged, conll}
.
pseudonymized
: Returns a string with the identified entities replaced by a pseudonym,tagged
: Returns a string with the identified entities followed by their assigned tag,conll
: Returns a string following the CoNLL format plus two columns containing the start and end position of the tokens in the original text.
URL : /
Method : POST
Data example All fields must be sent.
{
"text": "M. Pierre Sailly demeurant au 14 rue de la Felicité, 75007 Vienne.",
"output_type": "conll"
}
Condition : If everything is OK and the model inference was performed correctly
Code : 200 OK
Content example
{
"success": true,
"text": "M. Pierre <B-PER_PRENOM> Sailly <B-PER_NOM> demeurant au 14 <B-LOC> rue <I-LOC> de <I-LOC> la <I-LOC> Felicité <I-LOC> , <I-LOC> 75007 <I-LOC> Vienne <I-LOC> .\n\n"
}
Returns a map with the statistics of the API utilisation.
URL : /api_stats
Method : GET
Condition : If everything is OK
Code : 200 OK
Content example
{
"stats_info": {
"B-LOC": 3,
"B-PER_NOM": 3,
"B-PER_PRENOM": 3,
"I-LOC": 18,
"LOC": 7,
"PER_NOM": 9,
"PER_PRENOM": 7,
"avg_time_per_doc": 4209.854100431714,
"avg_time_per_sentence": 358.85739461853643,
"nb_analyzed_documents": 14,
"nb_analyzed_sentences": 50,
"output_type_conll": 4,
"output_type_pseudonymized": 3,
"output_type_tagged": 4
},
"success": true
}
The easiest way to test this application is by using Docker and Docker Compose.
- Clone this repo (for help see this tutorial).
- Set the environment variable
PSEUDO_MODEL_PATH
in the.env
file. - Launch the wrapper bash file
run_docker.sh
. This file will clean and rebuild the required Docker containers by callingdocker-compose.yml
. - Access the API at
localhost/
andlocalhost/api_stats
.
- Feel free to contact @psorianom or other Lab IA team members with any questions or if you are interested in contributing!