Pseudo API

This API is part of the document's pseudonymization effort lead at Etalab's Lab IA. Other Lab IA projects can be found at the Lab IA.

Project Status: [Active]

Intro/Objectives

The purpose of this repo is to provide an API endpoint to the pseudonymize documents. The API should make it easy to developers to automoate document pseudonnymization with their own models.

The larger goal of the pseudonymization project is to help the French Counsil of State open their court decisions to the general public, as required by law. More info about pseudonymization and this project can be found in our French pseudonymization guide here. Our API uses a Named Entity Recognition model to find and replace first names, last names, and addresses in court decisions (specifically those of the Counsil of State).

You need to train a NER model with the Flair library. Unfortunately, currently we cannot share our model nor the data it was trained on as it contains non-public information.

Methods Used

Natural Language Processing: Information Extraction : Named Entity Recognition
Natural Language Processing: Language Modelling / Feature Learning: Word embeddings
Machine Learning: Deep Learning: Recurrent Networks: BiLSTM+CRF

Technologies

Python
Flair, sacremoses
Flask, gunicorn, nginx
SQLite
Pandas
Docker

API Description

The API has two endpoints:

1. Pseudonymization

Analyzes a given string. The output is decided by the string passed to the output_type field. It may be one of {pseudonymized, tagged, conll}.

pseudonymized: Returns a string with the identified entities replaced by a pseudonym,
tagged: Returns a string with the identified entities followed by their assigned tag,
conll: Returns a string following the CoNLL format plus two columns containing the start and end position of the tokens in the original text.

URL : /

Method : POST

Data example All fields must be sent.

{
    "text": "M. Pierre Sailly demeurant au 14 rue de la Felicité, 75007 Vienne.",
    "output_type": "conll"
}

Success Response

Condition : If everything is OK and the model inference was performed correctly

Code : 200 OK

Content example

{
    "success": true,
    "text": "M. Pierre <B-PER_PRENOM> Sailly <B-PER_NOM> demeurant au 14 <B-LOC> rue <I-LOC> de <I-LOC> la <I-LOC> Felicité <I-LOC> , <I-LOC> 75007 <I-LOC> Vienne <I-LOC> .\n\n"
}

2. API Stats

Returns a map with the statistics of the API utilisation.

URL : /api_stats

Method : GET

Success Response

Condition : If everything is OK

Code : 200 OK

Content example

{
    "stats_info": {
        "B-LOC": 3,
        "B-PER_NOM": 3,
        "B-PER_PRENOM": 3,
        "I-LOC": 18,
        "LOC": 7,
        "PER_NOM": 9,
        "PER_PRENOM": 7,
        "avg_time_per_doc": 4209.854100431714,
        "avg_time_per_sentence": 358.85739461853643,
        "nb_analyzed_documents": 14,
        "nb_analyzed_sentences": 50,
        "output_type_conll": 4,
        "output_type_pseudonymized": 3,
        "output_type_tagged": 4
    },
    "success": true
}

Getting Started

The easiest way to test this application is by using Docker and Docker Compose.

Clone this repo (for help see this tutorial).
Set the environment variable PSEUDO_MODEL_PATH in the .env file.
Launch the wrapper bash file run_docker.sh. This file will clean and rebuild the required Docker containers by calling docker-compose.yml.
Access the API at localhost/ and localhost/api_stats.

Project Deliverables

Contact

Feel free to contact @psorianom or other Lab IA team members with any questions or if you are interested in contributing!

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
nginx		nginx
pdf_api		pdf_api
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
run_docker.sh		run_docker.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pseudo API

Project Status: [Active]

Intro/Objectives

Methods Used

Technologies

API Description

1. Pseudonymization

Success Response

2. API Stats

Success Response

Getting Started

Project Deliverables

Contact

About

Releases

Packages

Languages

License

Rob192/pdf_api

Folders and files

Latest commit

History

Repository files navigation

Pseudo API

Project Status: [Active]

Intro/Objectives

Methods Used

Technologies

API Description

1. Pseudonymization

Success Response

2. API Stats

Success Response

Getting Started

Project Deliverables

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages