Skip to content

Deep Learning vs Tranditional ML methods for TB Drug Resistance prediction from Genomic data

License

Notifications You must be signed in to change notification settings

VinylBr/DeepLearningvsML_TBResistancePrediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Learning vs Traditional ML for Genomic Datasets

Project Overview

Traditional Methods

  1. Pre-processing data using command-line tools (awk, gzip etc)
  2. Exploratory Data Analysis (Correlation, t-SNE, truncated-SVD etc)
  3. Implemented Logistic Regression, K-Nearest Neighbour Classifier, RandomForest Classifier, AdaBoost Classifier, and XGBoost Classifier
  4. Used weighted loss-function, stratified K-fold cross-validation, and appropriate metrics (balanced accuracy, f1-weighted, and ROC-AUC) to compare model performance
  5. Performed hyperparameter optimization of RandomForest Classifier (best model) using GridSearchCV and RandomizedSearchCV
  6. Use Recursive Feature Extraction using cross-validation to improve model performance

Deep Learning using Pytorch

  1. Implemented a deep neural network with three layers (ELU activation, Batch Normalization, and Dropout layer)
  2. To ensure correct implementation, trained the model on a single sample. Here, overfitting serves as a facile test
  3. Developed Convolution Neural Network (CNN) architecture and perfomed experiments to get optimal number of epoch, filter size, number of filters, and learning rates
  4. Implemented the final CNN architecture with three fitler sizes (100 filters each) with ELU activation followed by maxpool1d layer. Output is concatenated and sent to a dense neural layer with sigmoid output

Results

Deep learning framework gave a poorer performace in comparison to traditional ML frameworks for tabular dataset (as seen elsewhere: https://arxiv.org/abs/2106.03253). Random Forest Classifier with 183 estimators and a maximum depth of 223 gave the ROC-AUC score of 0.81.

Future work

  1. Use two convolution layers in the CNN architecture
  2. Use LSTM models with memory information to allow the model to capture wider context in the genome
  3. Include nucleotide specific information in the dataset for finer distinctions between mutations in the resistant and susceptible samples
  4. Incorporate RNAseq information in the tabular dataset

Packages used

  1. Command-line tools
  2. sklearn
  3. pytorch
  4. numpy
  5. matplotlib
  6. plotly
  7. seaborn
  8. tqdm

Data

CRyPTIC Consortium: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001721
Dataset: http://ftp.ebi.ac.uk/pub/databases/cryptic/release_june2022/

Sources of Inspiration