COSTA

This is the official repo of our SIGIR'2022 paper,"Pre-train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction".

Introduction

The foundation of effective search is high-quality text representation learning. Modern dense retrieval models usually employ pre-trained models like BERT as the text encoder. However, there is a gap between the pre-training objectives of BERT-like models and the requirements of dense retrieval as shown in Figure 1.

Existing works mainly utilize two types of methods to learn high-quality text sequence representations for dense retrieval, i.e., contrastive learning and the autoencoder-based language models. We list the pros and cons of these two methods in Figure 2. In this paper, therefore, we propose a novel COntrastive Span predicTion tAsk (COSTA), which leverages the merits of contrastive learning and autoencoder. The key idea is to force the encoder to generate the text representation close to its own random spans while far away from others using a groupwise contrastive loss. Our method only use the encoder and learn document-level text sequence representations by ”reconstructing” its own multiple spans. We do not actually generate the original texts, and only force the text sequence representations to be close with its own multiple span representations of different granularities. In this way, we can

Learn discriminative text sequence representations effectively while avoiding designing complex data augmentation techniques for contrastive learning.
Learn expressive text sequence representations efficiently while avoiding the bypass effect of autoencoder-based models thoroughly.
Resemble the relevance relationship between the query and the document since spans with different granularities can be treated as pseudo queries.

Pre-trained models in the Huggingface 🤗

We have uploaded COSTA pre-trained models to the Huggingface Hub, so you can easily use the COSTA models with Huggingface/Transformers library.

Model identifier in Huggingface Hub:

xyma/COSTA-wiki: The official COSTA model pre-trained on Wikipedia

For example,

tokenizer = AutoTokenizer.from_pretrained("xyma/COSTA-wiki")
model = AutoModel.from_pretrained("xyma/COSTA-wiki")

Preparing Data

Download the Wikipedia from the website and extract the text with WikiExtractor.py, and then apply any necessary cleanup and filter the short texts.

Download the two MS MARCO dense retrieval datasets from this website and the two TREC 2019 Deep Learning Track datasets from this website. Since these two TREC datasets use the same training set and dev set as the two MS MARCO datasets, so just download the test files. Put these datasets on

./data/marco-pas, ./data/marco-doc

Pre-training

Stay tuned! Come back soon!

Fine-tuning

Our fine-tunning code is based on the texttron toolkit.

See README.md for fine-tuning COSTA on passage retrieval datasets.

See README.md for fine-tuning COSTA on document retrieval datasets.

Fine-tuning Results

MS MARCO Passage Retrieval	MRR@10	Recall@1000	Files
COSTA (BM25 negs)	0.342	0.959	Model, Dev(MARCO format), Dev (TREC format)
COSTA (hard negs)	0.366	0.971	Model, Dev (MARCO format), Dev (TREC format)

TREC 2019 Passage Retrieval	NDCG@10	Recall@1000	Files
COSTA (BM25 negs)	0.635	0.773	Model, Test (TREC format)
COSTA (hard negs)	0.704	0.816	Model, Test (TREC format)

Run the following code to evaluate COSTA on MS MARCO Passage dataset.

./eval/eval_msmarco_passage.sh   ./marco_pas/qrels.dev.tsv ./costa_hd_neg8_e2_bs8_fp16_mrr10_366_r1000_971/encoding/dev.rank.tsv.marco

You will get

#####################
MRR @ 10: 0.36564396006731276
QueriesRanked: 6980
#####################

Run the following code to evaluate COSTA on TREC2019 Passage dataset.

./eval/trec_eval -m ndcg_cut.10 -m recall.1000  -c -l 2 ./marco_pas/qrels.dl19-passage.txt ./costa_hd_neg8_e2_bs8_fp16_mrr10_366_r1000_971/encoding/trec.rank.tsv.trec

You will get

recall_1000             all     0.8160
ndcg_cut_10             all     0.7043

MS MARCO Document Retrieval	MRR@100	Recall@100	Files
COSTA (1st iteration hard negs)	0.395	0.894	Model, Dev(MARCO format), Dev (TREC format)
COSTA (2nd iteration hard negs)	0.422	0.917	Model, Dev (MARCO format), Dev (TREC format)

TREC 2019 Document Retrieval	NDCG@10	Recall@100	Files
COSTA (1st iteration hard negs)	0.582	0.278	Model, Test (TREC format)
COSTA (2nd iteration hard negs)	0.626	0.320	Model, Test (TREC format)

Run the following code to evaluate COSTA on MS MARCO Document dataset.

./eval/eval_msmarco_doc.sh   ./marco_doc/qrels.dev.tsv ./costa_doc_w_doc395hn200_neg8_e1_bs8_extend_doc395_mrr100_422_r100_917/encoding/dev.rank.tsv.marco

You will get

#####################
MRR @ 100: 0.4215861855110516
QueriesRanked: 5193
#####################

Run the following code to evaluate COSTA on TREC2019 Document dataset.

./eval/trec_eval -m ndcg_cut.10 -m recall.100  ./marco_doc/msmarco-trec19-qrels.txt ./costa_doc_w_doc395hn200_neg8_e1_bs8_extend_doc395_mrr100_422_r100_917/encoding/trec.rank.tsv.trec

You will get

recall_100             all     0.3202
ndcg_cut_10            all     0.6260

Citation

If you find our work useful, please consider citing our paper:

@inproceedings{ma2022costa,
  author = {Ma, Xinyu and Guo, Jiafeng and Zhang, Ruqing and Fan, Yixing and Cheng, Xueqi},
  title = {Pre-Train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction},
  year = {2022},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3477495.3531772},
  doi = {10.1145/3477495.3531772},
  pages = {848–858},
  numpages = {11},
  location = {Madrid, Spain},
  series = {SIGIR '22}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
eval		eval
imgs		imgs
scripts		scripts
tevatron		tevatron
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COSTA

Introduction

Pre-trained models in the Huggingface 🤗

Preparing Data

Pre-training

Fine-tuning

Fine-tuning Results

Citation

About

Releases

Packages

Languages

Albert-Ma/COSTA

Folders and files

Latest commit

History

Repository files navigation

COSTA

Introduction

Pre-trained models in the Huggingface 🤗

Preparing Data

Pre-training

Fine-tuning

Fine-tuning Results

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages