This repository has source codes for learning medical concept embeddings (MCEs) in the following paper:
Comparative Effectiveness of Medical Concept Embedding for Feature Engineering in Phenotyping
Junghwan Lee*, Cong Liu*, Jae Hyun Kim, Alex Butler, Ning Shang,
Chao Pang, Karthik Natarajan, Parick Ryan, Casey Ta, Chunhua Weng
Preprint
We use GloVe, skip-gram, node2vec, LINE, and singular value decomposition (SVD) for learning MCEs. For implementation of node2vec and LINE, we used OpenNE, which is an open source python toolkit for network embedding. For implementation of singular value decomposition, we used SciPy. The source code in this repository is to implement GloVe and skip-gram.
Dataset should be prepared as two pickle files, encoded windowed-EHR and the dictionary for encoded medical concepts in the EHR. Example formats of the data are provided in data/.
- Encoded windowed-EHR: This is a list of windowed-EHR where each window contains multiple medical concepts. For example, [["concept A", "concept B"], ["Concept A", "Concept C", "Concept D"], ...]. No need to deliminate different patients or windows since we only utilize co-occurrence of the concepts in the same window. All concepts must be encoded with corresponding integer.
- Concept2id: This is a mapping dictionary for encoded medical concepts in the EHR. For example, if we encoded "Concept A" to integer 0 and "Concept B" to 1, concept2id will look like {0 : "Concept A", 1 : "Concept B"}.
-
Install Python 3.5.2 and all packages in the requirements.txt.
-
Prepare dataset.
-
Start training
python src/GloVe.py --input_record <"path of the EHR dataset"> --input_concept2id <"path of the concept2id"> --output <"output path"> --dim <"dimensionality of the embedding"> --batch_size <"batch size for training"> --num_epochs <"training epochs"> --learning_rate <"learning rate for optimizer">
You can check descriptions and the default settings of hyperparameters at help.
python src/Glove.py --help
-
Install Python 3.5.2 and all packages in the requirements.txt.
-
Prepare dataset.
-
Start training
python src/skipgram.py --input_record <"path of the EHR dataset"> --input_concept2id <"path of the concept2id"> --output <"output path"> --dim <"dimensionality of the embedding"> --batch_size <"batch size for training"> --num_epochs <"training epochs"> --learning_rate <"learning rate for optimizer">
You can check descriptions and the default settings of hyperparameters at help.
python src/skipgram.py --help
The data containing protected health information have been removed from all publicly available materials.