QATask

NOTE:

Assume the ONLY working directory is /absolute/path/to/QATask/.
Run python setup.py install ; pip install -r requirements.txt before doing anything else.
Run python setup.py install again after making any changes in folder drqa.

To do:

Possible retrievers:

How to create database sqlite

First, create a folder named qatask/database/wikipedia_db with a __init__.py iniside it.

Download and save ZaloAI's datasets:

wiki articles as datasets/wikipedia.jsonl
Train and test files as datasets/train_test_files/train_sample.json and datasets/train_test_files/test_sample.json

To clean and slice the wiki articles, run:

python3 -m tools.wiki_utils.wiki_slicing --data-path datasets/wikipedia.jsonl --output-path datasets/wikicorpus/wiki.jsonl

BM25

Generate BM25 index. First, make checkpoint/indexes/BM25 folder, then run this command to make BM25 index.

Build retriever indexes

python3 -m tools.pysirini.convert_format_sirini --data-path datasets/wikicorpus/wiki.jsonl --output-path datasets/wikiarticle_retrieve/wiki_sirini.json

python3 -m tools.pysirini.generate_sparse --cfg configs/retriever/BM25.yaml

If you want to use BM25 post processor which retrieves wikipage as answer given a short candidate (produced by BERT), run this

Build database for postprocess

python -m qatask.database.sqlite --data_path_fn datasets/wikipedia.jsonl --save_path qatask/database/wikipedia_db/wikisqlite_post.db

Build postprocessor indexes

python3 -m tools.pysirini.convert_wikipage_sirini --data-path datasets/wikipedia.jsonl --output-path datasets/wikipage_post/page_sirini.jsonl
              
python3 -m tools.pysirini.generate_sparse --cfg configs/postprocessor/BM25.yaml

Running inference

After getting BM25 index, run main pipeline to output with finetuned BERT.

python3 main.py --cfg configs/main/BM25_BERT_val.yaml \
                --output-path datasets/output/BM25_BERT_val.json \
                --sample-path datasets/train_test_files/train_sample.json \
                --mode val \
                --size-infer 2000

Faiss Retriever

Then run the following script: If you want to use Sirini retrievers you need to translate Vietnamese corpus into english and in Sirini format

# translate Vietnamese corpus into english and change to Sirini format
python3 -m torch.distributed.launch -m tools.translate_eng

# Create a FAISS index for your favourite Sirini retriever by configs file 
python3 -m tools.pysirini.generating_dense.py --cfg configs/retriever/colbertv2.yaml

Now you can have a Sirini searcher works like a normal retriever (e.g. TFIDF). Just run main with your config configs/colbertv2.yaml:

python3 main.py --cfg configs/main/colbertv2.yaml --output-path qatask/database/datasets/output/colbertv2_answer.json

Fine-tune reader

Change the parameters in the configs/reader/xlm_roberta.yaml file and run the following:

python3 tools/finetune_reader/train.py --cfg configs/reader/xlm_roberta.yaml

When the training completed, change the reader.model_checkpoint path in configs/main/*.yaml to the saved checkpoint.

Main pipeline

Or you can run TFIDF retriever baseline method which does not require any above command.

python3 main.py --cfg configs/main/baseline.yaml

Evaluate on the train dataset

By default, run this will calculate the EM of the predicted answers (wikipage, number, date):

python3 tools/get_accuracy.py --pred  qatask/database/datasets/output/<output_file>.json \
                              --truth qatask/database/datasets/train_test_files/train_sample.json

Optional: argument --answer-acc is used for calculating the F1 and EM of the short_candidate_answer, thus the <output_file>.json must be generated with out postprocessing.

Customize

If you want to add new modules. Please, visit qatask/* and inherit classes base.py. For example,

XLMReader(BaseReader):
    def __init__(self, cfg)
        ...

and register your module class in builder.py and change your name class in baseline.yaml or make your own configuration file your_config.yaml

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
cluster_scripts		cluster_scripts
configs		configs
fairseq		fairseq
notebooks		notebooks
qatask		qatask
tools		tools
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

QATask

To do:

Possible retrievers:

How to create database sqlite

BM25

Build retriever indexes

Build database for postprocess

Build postprocessor indexes

Running inference

Faiss Retriever

Fine-tune reader

Main pipeline

Evaluate on the train dataset

Customize

About

Releases

Packages

Contributors 3

Languages

JohnToro-CZAF/QATask

Folders and files

Latest commit

History

Repository files navigation

QATask

To do:

Possible retrievers:

How to create database sqlite

BM25

Build retriever indexes

Build database for postprocess

Build postprocessor indexes

Running inference

Faiss Retriever

Fine-tune reader

Main pipeline

Evaluate on the train dataset

Customize

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages