Implemenation of "Improved Clinical Abbreviation Expansion via Non-Sense-Based Approaches" (paper), ML4H (Machine Learning for Health) workshop at NeurIPS 2020.
This repository contains the non-sense-based (without gloss) and sense-based (with gloss) approaches to clinical abbreviation expansion based on BERT (The code of the one with permutation language model is coming soon in another repository). The code is based on BlueBERT (previously named as NCBI-BERT), which is a biomedical version of BERT.
- Tensorflow 1.12+
- Pre-trained model of BlueBERT
- A clinical abbreviation expansion dataset (MSH, UMN, or ShARe/CLEF 2013 Task 2)
# Install required python packages on your environment
$ pip install -r requirement.txt
# Download the BlueBERT parameters
$ wget https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI-BERT/NCBI_BERT_pubmed_mimic_uncased_L-12_H-768_A-12.zip
$ unzip NCBI_BERT_pubmed_mimic_uncased_L-12_H-768_A-12.zip -d bert_models
# Download and prepare dataset
$ ./scripts/download_umn.sh # UMN
$ ./scripts/download_msh.sh # MSH (downloading dataset required)
# Run the notebook scripts/preprocess_sc13t2.ipynb for the ShARe/CLEF dataset (manual downloading and installation required)
# Fine-tune and evaulate the model.
$ ./scripts/umn_masklm2.sh # Masked LM on UMN, one of 10-fold CV
$ ./scripts/msh_masklm2_new.sh # Masked LM on MSH, one of 10-fold CV
$ ./scripts/sc13t2_masklm2.sh # Masked LM on ShARe/CLEF. Please run scripts/evaluate_sc13t2_lrabr.ipynb to compute the accuracy on test unseen examples.
We thank the authors of BERT and BlueBERT for the implementation and the weights pre-trained on biomedical corpora.
@InProceeings{juyong2020improved,
author = {Juyong Kim and Linyuan Gong and Justin Khim and Jeremy C. Weiss and Pradeep Ravikumar},
title = {Improved Clinical Abbreviation Expansion via Non-Sense-Based Approaches},
booktitle = {Proceedings of the Machine Learning for Health NeurIPS Workshop (ML4H 2020)}
year = {2020}
}