This is the official repo of our SIGIR'2022 paper,"Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction".
The foundation of effective search is high-quality text representation learning. Modern dense retrieval models usually employ pre-trained models like BERT as the text encoder. However, there is a gap between the pre-training objectives of BERT-like models and the requirements of dense retrieval as shown in Figure 1.
Existing works mainly utilize two types of methods to learn high-quality text sequence representations for dense retrieval, i.e., contrastive learning and the autoencoder-based language models. We list the pros and cons of these two methods in Figure 2. In this paper, therefore, we propose a novel COntrastive Span predicTion tAsk (COSTA), which leverages the merits of contrastive learning and autoencoder. The key idea is to force the encoder to generate the text representation close to its own random spans while far away from others using a groupwise contrastive loss. Our method only use the encoder and learn document-level text sequence representations by ”reconstructing” its own multiple spans. We do not actually generate the original texts, and only force the text sequence representations to be close with its own multiple span representations of different granularities. In this way, we can
- Learn discriminative text sequence representations effectively while avoiding designing complex data augmentation techniques for contrastive learning.
- Learn expressive text sequence representations efficiently while avoiding the bypass effect of autoencoder-based models thoroughly.
- Resemble the relevance relationship between the query and the document since spans with different granularities can be treated as pseudo queries.
We have uploaded COSTA pre-trained models to the Huggingface Hub, so you can easily use the COSTA models with Huggingface/Transformers library.
Model identifier in Huggingface Hub:
xyma/COSTA-wiki
: The official COSTA model pre-trained on Wikipedia
For example,
tokenizer = AutoTokenizer.from_pretrained("xyma/COSTA-wiki")
model = AutoModel.from_pretrained("xyma/COSTA-wiki")
Download the Wikipedia from the website and extract the text with WikiExtractor.py
, and then apply any necessary cleanup and filter the short texts.
Download the two MS MARCO dense retrieval datasets from this website and the two TREC 2019 Deep Learning Track datasets from this website. Since these two TREC datasets use the same training set and dev set as the two MS MARCO datasets, so just download the test files. Put these datasets on
./data/marco-pas, ./data/marco-doc
Stay tuned! Come back soon!
Our fine-tunning code is based on the texttron toolkit.
See README.md for fine-tuning COSTA on passage retrieval datasets.
See README.md for fine-tuning COSTA on document retrieval datasets.
MS MARCO Passage Retrieval | MRR@10 | Recall@1000 | Files |
---|---|---|---|
COSTA (BM25 negs) | 0.342 | 0.959 | Model, Dev(MARCO format), Dev (TREC format) |
COSTA (hard negs) | 0.366 | 0.971 | Model, Dev (MARCO format), Dev (TREC format) |
TREC 2019 Passage Retrieval | NDCG@10 | Recall@1000 | Files |
---|---|---|---|
COSTA (BM25 negs) | 0.635 | 0.773 | Model, Test (TREC format) |
COSTA (hard negs) | 0.704 | 0.816 | Model, Test (TREC format) |
Run the following code to evaluate COSTA on MS MARCO Passage dataset.
./eval/eval_msmarco_passage.sh ./marco_pas/qrels.dev.tsv ./costa_hd_neg8_e2_bs8_fp16_mrr10_366_r1000_971/encoding/dev.rank.tsv.marco
You will get
#####################
MRR @ 10: 0.36564396006731276
QueriesRanked: 6980
#####################
Run the following code to evaluate COSTA on TREC2019 Passage dataset.
./eval/trec_eval -m ndcg_cut.10 -m recall.1000 -c -l 2 ./marco_pas/qrels.dl19-passage.txt ./costa_hd_neg8_e2_bs8_fp16_mrr10_366_r1000_971/encoding/trec.rank.tsv.trec
You will get
recall_1000 all 0.8160
ndcg_cut_10 all 0.7043
MS MARCO Document Retrieval | MRR@100 | Recall@100 | Files |
---|---|---|---|
COSTA (1st iteration hard negs) | 0.395 | 0.894 | Model, Dev(MARCO format), Dev (TREC format) |
COSTA (2nd iteration hard negs) | 0.422 | 0.917 | Model, Dev (MARCO format), Dev (TREC format) |
TREC 2019 Document Retrieval | NDCG@10 | Recall@100 | Files |
---|---|---|---|
COSTA (1st iteration hard negs) | 0.582 | 0.278 | Model, Test (TREC format) |
COSTA (2nd iteration hard negs) | 0.626 | 0.320 | Model, Test (TREC format) |
Run the following code to evaluate COSTA on MS MARCO Document dataset.
./eval/eval_msmarco_doc.sh ./marco_doc/qrels.dev.tsv ./costa_doc_w_doc395hn200_neg8_e1_bs8_extend_doc395_mrr100_422_r100_917/encoding/dev.rank.tsv.marco
You will get
#####################
MRR @ 100: 0.4215861855110516
QueriesRanked: 5193
#####################
Run the following code to evaluate COSTA on TREC2019 Document dataset.
./eval/trec_eval -m ndcg_cut.10 -m recall.100 ./marco_doc/msmarco-trec19-qrels.txt ./costa_doc_w_doc395hn200_neg8_e1_bs8_extend_doc395_mrr100_422_r100_917/encoding/trec.rank.tsv.trec
You will get
recall_100 all 0.3202
ndcg_cut_10 all 0.6260
If you find our work useful, please consider citing our paper:
@inproceedings{ma2022costa,
author = {Ma, Xinyu and Guo, Jiafeng and Zhang, Ruqing and Fan, Yixing and Cheng, Xueqi},
title = {Pre-Train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction},
year = {2022},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3477495.3531772},
doi = {10.1145/3477495.3531772},
pages = {848–858},
numpages = {11},
location = {Madrid, Spain},
series = {SIGIR '22}
}