CSMeD: Citation Screening Meta-Dataset for systematic review automation evaluation

This package serves as basis for the paper: "CSMeD: Bridging the Dataset Gap in Automated Citation Screening for Systematic Literature Reviews" by Wojciech Kusa, Oscar E. Mendoza, Matthias Samwald, Petr Knoth, Allan Hanbury (2023)

Table of Contents

CSMeD: Title and abstract screening datasets
CSMeD-FT: Full-text screening dataset
Installation
Examples
Visualisations
Experiments

1. CSMeD: Citation screening datasets for title and abstract screening

Original datasets used to create CSMeD are described in the table below:

	Introduced in	# reviews	Domain	Avg. size	Avg. ratio of included (TA)	Avg. ratio of included (FT)	Additional data	Data URL	Cochrane	Publicly available	Included in CSMeD
1	Cohen et al. (2006)	15	Drug	1,249	7.7%	—	—	Web	—	✓	✓
2	Wallace et al. (2010)	3	Clinical	3,456	7.9%	—	—	GiitHub	—	✓	✓
3	Howard et al. (2015)	5	Mixed	19,271	4.6%	—	—	Supplementary	—	✓	✓
4	Miwa et al. (2015)	4	Social science	8,933	6.4%	—	—	—	—	—	—
5	Scells et al. (2017)	93	Clinical	1,159	1.2%	—	Search queries	GitHub	✓	✓	✓
6	CLEF TAR 2017	50	DTA	5,339	4.4%	—	Review protocol	GitHub	✓	✓	✓
7	CLEF TAR 2018	30	DTA	7,283	4.7%	—	Review protocol	GitHub	✓	✓	✓
8	CLEF TAR 2019	49	Mixed**	2,659	8.9%	—	Review protocol	GitHub	✓	✓	✓
9	Alharbi et al. (2019)	25	Clinical	4,402	0.4%	—	Review updates	GitHub	✓	✓	✓
10	Hannousse et al. (2022)	7	Computer Science	340	11.7%	—	Review protocol	GitHub	—	✓	✓

TA stands for Title + Abstract screening phase, FT for Full-text screening phase. Avg. size describes the size of a review in terms of the number records retrieved from the search query. Avg. ratio of included (TA) describes the average ratio of included records in the TA phase. Avg. ratio of included (FT) describes the average ratio of included records in the FT phase.

CSMeD datasets

CSMeD beyond offering unified access to the original datasets, provides a unified meta-dataset containing all the original datasets. Statistics of the CSMeD datasets are presented in the table below.

Dataset name	#reviews	#docs	#included	Avg. #docs	Avg. %included	Avg. #words in document
CSMeD-basic
CSMeD-basic-train	30	128,438	7,958	4,281	9.6%	229

CSMeD-cochrane
CSMeD-cochrane-train	195	372,422	7,589	1,910	21.9%	180
CSMeD-cochrane-dev	100	229,376	4,365	2,294	20.8%	201

CSMeD-all	325	730,236	19,912	2,247	20.5%	195

2. CSMeD-FT: Full-text screening dataset

Dataset name	#reviews	#docs.	#included	%included	Avg. #words in document	Avg. #words in review
CSMeD-FT-train	148	2,053	904	44.0%	4,535	1,493
CSMeD-FT-dev	36	644	202	31.4%	4,419	1,402
CSMeD-FT-test	29	636	278	43.7%	4,957	2,318
CSMeD-FT-test-small	16	50	22	44.0%	5,042	2,354

Column '#docs' refers to the total number of documents included in the dataset and '#included' mentions number of included documents on the full-text step. CSMeD-test-small is a subset of CSMeD-test.

3. Installation

Requirements

Assuming you have conda installed, to create environment for loading CSMeD run:

$ conda create -n csmed python=3.10
$ conda activate csmed
(csmed)$ pip install -r requirements.txt

Data acquisition prerequisites

To obtain the metadata for CSMeD-Cochrane datasets, you need to configure the cookie for the Cochrane Library website.

Furthermore, to obtain full-text PDFs for CSMeD-FT, you need to configure the following:

SemanticScholar API key: https://www.semanticscholar.org/product/api
CORE API key: https://core.ac.uk/services/api
GROBID: https://grobid.readthedocs.io/en/latest/Install-Grobid/

If you have all the prerequisites, run:

(csmed)$ python confgure.py

And follow the prompts providing API keys, cookies, email address to use PubMed Entrez APIs and paths to GROBID server. You don't need to provide all the information, the bare minimum to construct the datasets is the cookie from Cochrane Library and the email address for PubMed Entrez.

Downloading raw full-text datasets

First install additional requirements:

(csmed)$ pip install -r dev-requirements.txt

To download the datasets, run:

(csmed)$ python scripts/prepare_full_texts.py

4. Examples

Examples presenting how to use the datasets are available in the notebooks/ directory.

5. Visualisations

To run visualisations first you need to install additional requirements:

(csmed)$ pip install -r vis-requirements.txt

Then you can run the visualisations using streamlit:

(csmed)$ streamlit run visualisation/_🏠_Home.py.py

6. Experiments

Baseline experiments from the paper are described in the at: WojciechKusa/CSMeD-baselines repository.

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
csmed		csmed
data		data
notebooks		notebooks
scripts		scripts
visualisation		visualisation
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
configure.py		configure.py
dev-requirements.txt		dev-requirements.txt
requirements.txt		requirements.txt
setup.py		setup.py
vis-requirements.txt		vis-requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSMeD: Citation Screening Meta-Dataset for systematic review automation evaluation

1. CSMeD: Citation screening datasets for title and abstract screening

CSMeD datasets

2. CSMeD-FT: Full-text screening dataset

3. Installation

Requirements

Data acquisition prerequisites

Downloading raw full-text datasets

4. Examples

5. Visualisations

6. Experiments

About

Releases

Contributors 2

Languages

License

WojciechKusa/systematic-review-datasets

Folders and files

Latest commit

History

Repository files navigation

CSMeD: Citation Screening Meta-Dataset for systematic review automation evaluation

CSMeD datasets

Requirements

Data acquisition prerequisites

Downloading raw full-text datasets

About

Topics

Resources

License

Stars

Watchers

Forks

Languages