dataset-references

The code in this repository is used to train and apply a Named Entity Recognition (NER) model to detect informal references to datasets in academic literature. The labeled data are derived from the ICPSR Bibliography of Data-Related Literature and the Semantic Scholar Open Research Corpus. This analysis supports the paper, A Natural Language Processing Pipeline for Detecting Informal Data References in Academic Literature.

code/ner-demo.ipynb

Demonstration notebook of NER model applied to a paper

code/spacy-ner.ipynb

Training workflow for spaCy NER model using labeled data

config.cfg

NER model training parameters

data/

Datasets are sentences from academic articles named for sources from which they are derived. Training data were labeled, merged, and exported from Prodigy as of May 10, 2022 for use in spaCy with the following recipes:

prodigy db-in dataset_name /path/to/_data.jsonl
prodigy ner.manual dataset_name --label DATASET
prodigy data-to-spacy train --ner bibliography, paperpile, s2orc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

dataset-references

code/ner-demo.ipynb

code/spacy-ner.ipynb

config.cfg

data/

Files

README.md

Latest commit

History

README.md

File metadata and controls

dataset-references

code/ner-demo.ipynb

code/spacy-ner.ipynb

config.cfg

data/