Fine-tuning CamemBERT on a subset of FQuAD dataset for Question Answering

This repository contains code for a fine-tuning experiment of CamemBERT, a French version of the BERT language model, on a portion of the FQuAD (French Question Answering Dataset) for Question Answering tasks.

Dataset

The FQuAD dataset is a collection of questions and answers in French. It contains over 25,000 question-answer pairs and covers a wide range of topics, including history, science, and literature.

The CamemBERT model was fine-tuned on this dataset to create a French question-answering system. Due to the high computation requirements of the fine-tuning process, a CUDA-enabled GPU was used to train the model for 10 epochs on a subset of the dataset containing ~4,500 question-answer pairs.

Requirements

Python 3.10
CUDA-enabled GPU
CUDA 11.7 (Download from here)
Torch 2.0.0

Install Torch + CUDA 11.7 after creating/activating a virtual environment:

  pip install torch==2.0.0+cu117 -f https://download.pytorch.org/whl/cu117/torch_stable.html

Note: ~2.3Gb will be downloaded.

Install the required Python packages using pip:

  pip install -r requirements.txt

Prepare datasets

Download the FQuAD dataset from here, then move the json files to data/raw/.
To prepare a subset of the data and split it into train and validation sets, run the make_dataset.py script.

  python src/make_dataset.py -i train.json

Fine-tuning

To fine-tune CamemBERT on a portion of the FQuAD dataset, run the train.py script:

  python src/train.py

The train.py script loads the dataset, prepares it for training, and fine-tunes the CamemBERT model. The trained model is saved to the models/ directory.

Evaluate

The evaluation of the model is saved in a bit messy way. Results are saved in results.json in the root level of the repo. Other evaluations during training can be found at "/models/". Will be fixed in the future.

Inference

To use the trained model for inference on new question-answer pairs, run the predict.py script:

  python src/predict.py --question "A question" --context "A context"

The predict.py script loads the saved model from the models/ directory and uses it to answer the provided question in the given context.

Example:

   python src/predict.py --question "Comment s'appelle le troisième navire commandé ?" --context "Avant la Première Guerre mondiale, l'International Mercantile Marine Co. commande aux chantiers Harland & Wolff la construction de plusieurs navires destinés à ses compagnies. Les deux premiers, le Regina et le Pittsburgh ébauchés en 1913, sont achevés après la guerre et mis en service au début des années 1920. Le Doric est le troisième navire construit sur ce modèle et un quatrième, légèrement plus grand, le Laurentic, suivra en 1927. La quille du Doric est posée bien après la guerre, en 1921, et sa construction est rapide, puisqu'il est lancé dès le 8 août 1922 et livré le 29 mai 1923."

Output: Answer: Le Doric

Demo

The fine-tuned CamemBERT model was deployed using FastAPI. You can try the demo by running app.py

  python demo/app.py

Once the server is running, open a web browser and go to http://127.0.0.1:8000

References

FQuAD: French Question Answering Dataset - 2020arXiv200206071

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Fine-tuning CamemBERT on a subset of FQuAD dataset for Question Answering

Dataset

Requirements

Prepare datasets

Fine-tuning

Evaluate

Inference

Demo

References

Files

README.md

Latest commit

History

README.md

File metadata and controls

Fine-tuning CamemBERT on a subset of FQuAD dataset for Question Answering

Dataset

Requirements

Prepare datasets

Fine-tuning

Evaluate

Inference

Demo

References