CAZy's little helper

A model to predict the compatibility of scientific literature with the CAZy database, the endgoal would be to assist biocurators by giving a score of confidence for each article that asseses its compatibility with certain criteria needed to integrate the database.

Analysis pipeline

List of PMIDs/DOIs → PMCIDs (if available, otherwise leave it as a PMID) → Full text (if available so can be scraped with Bilbio, otherwise get only the abstract via eutils) → Preprocessing → Representation → Classifying → % Confidence score.

Installation

git clone https://github.com/dabane-ghassan/cazy-little-helper.git
cd cazy-little-helper
pip install -r requirements.txt
cd src

Getting started

Predict

usage: python3 predict.py [-h] -i INPUT_PATH [-p ID_POS] [-b BIBLIO_ADD] [-m MODEL]

Arguments:
  -h, --help            show this help message and exit
  -i INPUT_PATH, --input_path INPUT_PATH
                        [REQUIRED] The input data file path,a .csv file with a column of article IDs
  -p ID_POS, --id_pos ID_POS
                        [OPTIONAL] The index of the ID column in the input file path, default is 0 (first column).
  -b BIBLIO_ADD, --biblio_add BIBLIO_ADD
                        [OPTIONAL] The address of the biblio package on the php server, default is http://10.1.22.212/Biblio
  -m MODEL, --model MODEL
                        [OPTIONAL] The model path to run the predictions, default is the CAZy's little helper already trained model based on Aug 2021 data, '../model/cazy_helper.joblib'

Create Model

usage: python3 create.py [-h] -p OUTPUT_PATH -d DATASET [-b BIBLIO_ADD] [-s VAL_SIZE]

Arguments:
  -h, --help            show this help message and exit
  -p OUTPUT_PATH, --output_path OUTPUT_PATH
                        [REQUIRED] The save path for the new model.
  -d DATASET, --dataset DATASET
                        [REQUIRED] The training dataset, a two column .csv file.
  -b BIBLIO_ADD, --biblio_add BIBLIO_ADD
                        [OPTIONAL] The address of the biblio package on the php server, default is http://10.1.22.212/Biblio
  -s VAL_SIZE, --val_size VAL_SIZE
                        [OPTIONAL] The validation dataset size, default is 0.15

So let's say you're a passionate CAZy researcher, and you want to retrain a new model with new data in order to acquire more accurate confidence scores:

mkdir new_model && cd new_model
wget https://raw.githubusercontent.com/dabane-ghassan/cazy-little-helper/main/training/classifier_train.csv

Now you can annotate the training dataset with new PMCIDs (only PMCIDs) if available, just adding 1 to the label column if the new article is compatible, 0 otherwise. After this, we can launch the creation of a new model.

python3 create.py -p new_model.joblib -d classifier_train.csv

And now this new model can be used to make predictions! (by specifying its path using the parameter -m in the predict CLI)

Find IDs

This last functionality is just la cerise sur le gâteau, just input any .csv file with a column of article IDs (PMIDs, PMCIDs or DOIs); preferably a one column .csv file without a header, and it will be converted to the ID type of your choice.

usage: python3 find.py [-h] -i INPUT_PATH -t ID_TYPE

Arguments:
  -h, --help            show this help message and exit
  -i INPUT_PATH, --input_path INPUT_PATH
                        [REQUIRED] The input ID file path, a .csv file with a column of article IDs.
  -t ID_TYPE, --id_type ID_TYPE
                        [REQUIRED] The type of ID to find, ['PMID', 'PMCID', 'DOI'], uppercase only.

Under the hood: What is CAZy's little helper?

CAZy's little helper is a TF-IDF/SVM machine learning model, it uses Term Frequency - Inverse Document Frequency (TF-IDF) for text representation and a linear kernel Support Vector Machine for classification.

Performance:

Test dataset	Precision	Recall	f1-score	Support
CAZyDB-	0.97	0.98	0.98	1326
CAZyDB+	0.81	0.74	0.77	135
Accuracy			0.96	1461
Macro average	0.89	0.86	0.88	1461
Weighted average	0.96	0.96	0.96	1461

Before choosing this particular architecture, a panel of Natural Language Processing (NLP) methods for text classification were used and tested based upon the custom-created dataset for the CAZy database; methods ranging from classical text representation tools like TF-IDF and word embeddings (Word2Vec) as well as unsupervised topic modeling using LDA (Latent Dirichlet Allocation), to even state-of-the-art deep learning approaches like BERT (Bidirectional Encoder Representation from Transformers). Furthermore, all the above approaches were benchmarked on the validation and the test datasets and the ROC-AUC curves are compared.

	Model
ROC-AUC curve	TF-IDF/SVM	LDA/Random Forest	Word2Vec/SVM	BERT
	TF-IDF/Random Forest	Ensemble Classifier*
	TF-IDF/Naive Bayes

*A soft-voting classifier which relies upon two models, LDA/Random Forest and TF-IDF/SVM

More information about the dataset and methods? You're more than welcome to take a bit more extensive look here.

About

This project was a part of a 2-months internship at the Architecture et Fonction des Macromolécules Biologiques laboratory (AFMB, Marseille, France), hosted within the Glycogenomics team.

Acknowledgements

First, I would like to start by thanking Dr. Nicolas Terrapon for his patience, precious help and invaluable supervision, not to mention the oppurtunity that he gave me to work on such an interesting project. In addition, I would like to deeply thank Dr. Philippe Ortet for his precious ideas, wonderful insights and his guidance and expertise that helped me easily navigate and use various complex subjects throughout the projet. Last but not least, I would like to finish by thanking all the Glycogenomics team for their appreciable hospitality.

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
analysis		analysis
model		model
src		src
tests		tests
training		training
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
graphix.png		graphix.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CAZy's little helper

Analysis pipeline

Installation

Getting started

Predict

Create Model

Find IDs

Under the hood: What is CAZy's little helper?

TF-IDF/SVM

LDA/Random Forest

Word2Vec/SVM

BERT

TF-IDF/Random Forest

Ensemble Classifier*

TF-IDF/Naive Bayes

About

Acknowledgements

📜 License

About

Releases 1

Packages

Languages

License

dabane-ghassan/cazy-little-helper

Folders and files

Latest commit

History

Repository files navigation

CAZy's little helper

Analysis pipeline

Installation

Getting started

Predict

Create Model

Find IDs

Under the hood: What is CAZy's little helper?

TF-IDF/SVM

LDA/Random Forest

Word2Vec/SVM

BERT

TF-IDF/Random Forest

Ensemble Classifier*

TF-IDF/Naive Bayes

About

Acknowledgements

📜 License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages