Deep Learning vs Traditional ML for Genomic Datasets

Project Overview

Traditional Methods

Pre-processing data using command-line tools (awk, gzip etc)
Exploratory Data Analysis (Correlation, t-SNE, truncated-SVD etc)
Implemented Logistic Regression, K-Nearest Neighbour Classifier, RandomForest Classifier, AdaBoost Classifier, and XGBoost Classifier
Used weighted loss-function, stratified K-fold cross-validation, and appropriate metrics (balanced accuracy, f1-weighted, and ROC-AUC) to compare model performance
Performed hyperparameter optimization of RandomForest Classifier (best model) using GridSearchCV and RandomizedSearchCV
Use Recursive Feature Extraction using cross-validation to improve model performance

Deep Learning using Pytorch

Implemented a deep neural network with three layers (ELU activation, Batch Normalization, and Dropout layer)
To ensure correct implementation, trained the model on a single sample. Here, overfitting serves as a facile test
Developed Convolution Neural Network (CNN) architecture and perfomed experiments to get optimal number of epoch, filter size, number of filters, and learning rates
Implemented the final CNN architecture with three fitler sizes (100 filters each) with ELU activation followed by maxpool1d layer. Output is concatenated and sent to a dense neural layer with sigmoid output

Results

Deep learning framework gave a poorer performace in comparison to traditional ML frameworks for tabular dataset (as seen elsewhere: https://arxiv.org/abs/2106.03253). Random Forest Classifier with 183 estimators and a maximum depth of 223 gave the ROC-AUC score of 0.81.

Future work

Use two convolution layers in the CNN architecture
Use LSTM models with memory information to allow the model to capture wider context in the genome
Include nucleotide specific information in the dataset for finer distinctions between mutations in the resistant and susceptible samples
Incorporate RNAseq information in the tabular dataset

Packages used

Command-line tools
sklearn
pytorch
numpy
matplotlib
plotly
seaborn
tqdm

Data

CRyPTIC Consortium: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001721
Dataset: http://ftp.ebi.ac.uk/pub/databases/cryptic/release_june2022/

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Main.ipynb		Main.ipynb
Project_184026002.ipynb		Project_184026002.ipynb
README.md		README.md
datacreate.sh		datacreate.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning vs Traditional ML for Genomic Datasets

Project Overview

Traditional Methods

Deep Learning using Pytorch

Results

Future work

Packages used

Data

Sources of Inspiration

About

Releases

Packages

Languages

License

VinylBr/DeepLearningvsML_TBResistancePrediction

Folders and files

Latest commit

History

Repository files navigation

Deep Learning vs Traditional ML for Genomic Datasets

Project Overview

Traditional Methods

Deep Learning using Pytorch

Results

Future work

Packages used

Data

Sources of Inspiration

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages