NOTE:
- Assume the ONLY working directory is
/absolute/path/to/QATask/
. - Run
python setup.py install ; pip install -r requirements.txt
before doing anything else. - Run
python setup.py install
again after making any changes in folderdrqa
.
- Clean wiki articles
- Attaching ID for each wiki article
- SQL retrieving according to ID
- Retriever returns ID
- Build a reader
- Properly slicing
- Finetuned on ZaloAI dataset
- Voting and combining retriever and reader scores
- Ensember 2 readers
- Other retrieving methods for vietnamese passages
- TF-IDF
- BM25
- DPR = BERT trained on question+context_passage vietnamese embeddings + FAISS for searching
- ANCE
- ColBertv2
First, create a folder named qatask/database/wikipedia_db
with a __init__.py
iniside it.
Download and save ZaloAI's datasets:
- wiki articles
as
datasets/wikipedia.jsonl
- Train and test files as
datasets/train_test_files/train_sample.json
anddatasets/train_test_files/test_sample.json
To clean and slice the wiki articles, run:
python3 -m tools.wiki_utils.wiki_slicing --data-path datasets/wikipedia.jsonl --output-path datasets/wikicorpus/wiki.jsonl
Generate BM25 index. First, make checkpoint/indexes/BM25
folder, then run this command to make BM25 index.
python3 -m tools.pysirini.convert_format_sirini --data-path datasets/wikicorpus/wiki.jsonl --output-path datasets/wikiarticle_retrieve/wiki_sirini.json
python3 -m tools.pysirini.generate_sparse --cfg configs/retriever/BM25.yaml
If you want to use BM25 post processor which retrieves wikipage as answer given a short candidate (produced by BERT), run this
python -m qatask.database.sqlite --data_path_fn datasets/wikipedia.jsonl --save_path qatask/database/wikipedia_db/wikisqlite_post.db
python3 -m tools.pysirini.convert_wikipage_sirini --data-path datasets/wikipedia.jsonl --output-path datasets/wikipage_post/page_sirini.jsonl
python3 -m tools.pysirini.generate_sparse --cfg configs/postprocessor/BM25.yaml
After getting BM25 index, run main pipeline to output with finetuned BERT.
python3 main.py --cfg configs/main/BM25_BERT_val.yaml \
--output-path datasets/output/BM25_BERT_val.json \
--sample-path datasets/train_test_files/train_sample.json \
--mode val \
--size-infer 2000
Then run the following script: If you want to use Sirini retrievers you need to translate Vietnamese corpus into english and in Sirini format
# translate Vietnamese corpus into english and change to Sirini format
python3 -m torch.distributed.launch -m tools.translate_eng
# Create a FAISS index for your favourite Sirini retriever by configs file
python3 -m tools.pysirini.generating_dense.py --cfg configs/retriever/colbertv2.yaml
Now you can have a Sirini searcher works like a normal retriever (e.g. TFIDF). Just run main
with your config configs/colbertv2.yaml
:
python3 main.py --cfg configs/main/colbertv2.yaml --output-path qatask/database/datasets/output/colbertv2_answer.json
Change the parameters in the configs/reader/xlm_roberta.yaml
file and run the following:
python3 tools/finetune_reader/train.py --cfg configs/reader/xlm_roberta.yaml
When the training completed, change the reader.model_checkpoint
path in configs/main/*.yaml
to the saved checkpoint.
Or you can run TFIDF retriever baseline method which does not require any above command.
python3 main.py --cfg configs/main/baseline.yaml
By default, run this will calculate the EM of the predicted answers (wikipage, number, date):
python3 tools/get_accuracy.py --pred qatask/database/datasets/output/<output_file>.json \
--truth qatask/database/datasets/train_test_files/train_sample.json
Optional: argument --answer-acc
is used for calculating the F1 and EM of the short_candidate_answer, thus the <output_file>.json
must be generated with out postprocessing
.
If you want to add new modules. Please, visit qatask/* and inherit classes base.py. For example,
XLMReader(BaseReader):
def __init__(self, cfg)
...
and register your module class in builder.py
and change your name class in baseline.yaml or make your own configuration file your_config.yaml