- Pre-processing data using command-line tools (awk, gzip etc)
- Exploratory Data Analysis (Correlation, t-SNE, truncated-SVD etc)
- Implemented Logistic Regression, K-Nearest Neighbour Classifier, RandomForest Classifier, AdaBoost Classifier, and XGBoost Classifier
- Used weighted loss-function, stratified K-fold cross-validation, and appropriate metrics (balanced accuracy, f1-weighted, and ROC-AUC) to compare model performance
- Performed hyperparameter optimization of RandomForest Classifier (best model) using GridSearchCV and RandomizedSearchCV
- Use Recursive Feature Extraction using cross-validation to improve model performance
- Implemented a deep neural network with three layers (ELU activation, Batch Normalization, and Dropout layer)
- To ensure correct implementation, trained the model on a single sample. Here, overfitting serves as a facile test
- Developed Convolution Neural Network (CNN) architecture and perfomed experiments to get optimal number of epoch, filter size, number of filters, and learning rates
- Implemented the final CNN architecture with three fitler sizes (100 filters each) with ELU activation followed by maxpool1d layer. Output is concatenated and sent to a dense neural layer with sigmoid output
Deep learning framework gave a poorer performace in comparison to traditional ML frameworks for tabular dataset (as seen elsewhere: https://arxiv.org/abs/2106.03253). Random Forest Classifier with 183 estimators and a maximum depth of 223 gave the ROC-AUC score of 0.81.
- Use two convolution layers in the CNN architecture
- Use LSTM models with memory information to allow the model to capture wider context in the genome
- Include nucleotide specific information in the dataset for finer distinctions between mutations in the resistant and susceptible samples
- Incorporate RNAseq information in the tabular dataset
- Command-line tools
- sklearn
- pytorch
- numpy
- matplotlib
- plotly
- seaborn
- tqdm
CRyPTIC Consortium: https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3001721
Dataset: http://ftp.ebi.ac.uk/pub/databases/cryptic/release_june2022/
- https://chriskhanhtran.github.io/posts/cnn-sentence-classification/
- https://campus.datacamp.com/courses/intermediate-deep-learning-with-pytorch/images-convolutional-neural-networks?ex=1
- https://app.datacamp.com/learn/courses/introduction-to-deep-learning-with-pytorch
- https://www.kaggle.com/learn/intro-to-deep-learning