A model to predict the compatibility of scientific literature with the CAZy database, the endgoal would be to assist biocurators by giving a score of confidence for each article that asseses its compatibility with certain criteria needed to integrate the database.
- List of PMIDs/DOIs → PMCIDs (if available, otherwise leave it as a PMID) → Full text (if available so can be scraped with Bilbio, otherwise get only the abstract via eutils) → Preprocessing → Representation → Classifying → % Confidence score.
git clone https://github.com/dabane-ghassan/cazy-little-helper.git
cd cazy-little-helper
pip install -r requirements.txt
cd src
usage: python3 predict.py [-h] -i INPUT_PATH [-p ID_POS] [-b BIBLIO_ADD] [-m MODEL]
Arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
[REQUIRED] The input data file path,a .csv file with a column of article IDs
-p ID_POS, --id_pos ID_POS
[OPTIONAL] The index of the ID column in the input file path, default is 0 (first column).
-b BIBLIO_ADD, --biblio_add BIBLIO_ADD
[OPTIONAL] The address of the biblio package on the php server, default is http://10.1.22.212/Biblio
-m MODEL, --model MODEL
[OPTIONAL] The model path to run the predictions, default is the CAZy's little helper already trained model based on Aug 2021 data, '../model/cazy_helper.joblib'
usage: python3 create.py [-h] -p OUTPUT_PATH -d DATASET [-b BIBLIO_ADD] [-s VAL_SIZE]
Arguments:
-h, --help show this help message and exit
-p OUTPUT_PATH, --output_path OUTPUT_PATH
[REQUIRED] The save path for the new model.
-d DATASET, --dataset DATASET
[REQUIRED] The training dataset, a two column .csv file.
-b BIBLIO_ADD, --biblio_add BIBLIO_ADD
[OPTIONAL] The address of the biblio package on the php server, default is http://10.1.22.212/Biblio
-s VAL_SIZE, --val_size VAL_SIZE
[OPTIONAL] The validation dataset size, default is 0.15
- So let's say you're a passionate CAZy researcher, and you want to retrain a new model with new data in order to acquire more accurate confidence scores:
mkdir new_model && cd new_model
wget https://raw.githubusercontent.com/dabane-ghassan/cazy-little-helper/main/training/classifier_train.csv
Now you can annotate the training dataset with new PMCIDs (only PMCIDs) if available, just adding 1 to the label column if the new article is compatible, 0 otherwise. After this, we can launch the creation of a new model.
python3 create.py -p new_model.joblib -d classifier_train.csv
- And now this new model can be used to make predictions! (by specifying its path using the parameter -m in the predict CLI)
- This last functionality is just la cerise sur le gâteau, just input any .csv file with a column of article IDs (PMIDs, PMCIDs or DOIs); preferably a one column .csv file without a header, and it will be converted to the ID type of your choice.
usage: python3 find.py [-h] -i INPUT_PATH -t ID_TYPE
Arguments:
-h, --help show this help message and exit
-i INPUT_PATH, --input_path INPUT_PATH
[REQUIRED] The input ID file path, a .csv file with a column of article IDs.
-t ID_TYPE, --id_type ID_TYPE
[REQUIRED] The type of ID to find, ['PMID', 'PMCID', 'DOI'], uppercase only.
CAZy's little helper is a TF-IDF/SVM machine learning model, it uses Term Frequency - Inverse Document Frequency (TF-IDF) for text representation and a linear kernel Support Vector Machine for classification.
- Performance:
Test dataset | Precision | Recall | f1-score | Support |
---|---|---|---|---|
CAZyDB- | 0.97 | 0.98 | 0.98 | 1326 |
CAZyDB+ | 0.81 | 0.74 | 0.77 | 135 |
Accuracy | 0.96 | 1461 | ||
Macro average | 0.89 | 0.86 | 0.88 | 1461 |
Weighted average | 0.96 | 0.96 | 0.96 | 1461 |
- Before choosing this particular architecture, a panel of Natural Language Processing (NLP) methods for text classification were used and tested based upon the custom-created dataset for the CAZy database; methods ranging from classical text representation tools like TF-IDF and word embeddings (Word2Vec) as well as unsupervised topic modeling using LDA (Latent Dirichlet Allocation), to even state-of-the-art deep learning approaches like BERT (Bidirectional Encoder Representation from Transformers). Furthermore, all the above approaches were benchmarked on the validation and the test datasets and the ROC-AUC curves are compared.
Model | ||||
---|---|---|---|---|
ROC-AUC curve | ||||
*A soft-voting classifier which relies upon two models, LDA/Random Forest and TF-IDF/SVM
- More information about the dataset and methods? You're more than welcome to take a bit more extensive look here.
This project was a part of a 2-months internship at the Architecture et Fonction des Macromolécules Biologiques laboratory (AFMB, Marseille, France), hosted within the Glycogenomics team.
First, I would like to start by thanking Dr. Nicolas Terrapon for his patience, precious help and invaluable supervision, not to mention the oppurtunity that he gave me to work on such an interesting project. In addition, I would like to deeply thank Dr. Philippe Ortet for his precious ideas, wonderful insights and his guidance and expertise that helped me easily navigate and use various complex subjects throughout the projet. Last but not least, I would like to finish by thanking all the Glycogenomics team for their appreciable hospitality.
MIT Licensed © Ghassan Dabane, 2021.