Skip to content
This repository has been archived by the owner on Jul 8, 2023. It is now read-only.

Latest commit

 

History

History
139 lines (91 loc) · 5.75 KB

File metadata and controls

139 lines (91 loc) · 5.75 KB

Drug Discovery Feature Selections

Experiment replication of Rahman Pujianto master thesis research (Universitas Indonesia, 2017).

Dataset Preparation

Prepared (trainable) datasets are provided in dataset/dataset.tar.gz. Information below are provided as an additional information on how to prepare the dataset from raw sources (*.sdf or .mol2 files).

Required tools:

Extracting positive (label = 1) training data:

  1. Convert sdf to mol2: obabel ../dataset/pubchem-compound-active-hiv1-protease.sdf -O ../dataset/pubchem-compound-active-hiv1-protease_mol2/hiv1-protease.mol2
  2. Convert mol2 tp csv: java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/pubchem-compound-active-hiv1-protease_mol2/ -file ../dataset/pubchem-compound-active-hiv1-protease.csv

Extracting negative (label = 0) training data:

  1. Convert sdf to mol2: ../dataset/obabel decoys_final.sdf -O ../dataset/decoys_final_mol2/decoys_final.mol2
  2. Convert mol2 tp csv: java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/decoys_final_mol2/ -file ../dataset/decoys.csv

Extracting test data (unlabeled):

  1. Convert mol2 tp csv: java -jar PaDEL-Descriptor.jar -2d -addhydrogens -removesalt -dir ../dataset/HerbalDB_mol2/ -file ../dataset/HerbalDB.csv

Dataset Description

Datasets provided in this repo:

  1. dataset/dataset.tar.gz:
    1. dataset.csv: 3,665 HIV-1 protease inhibitor from PubChem Bioassay + 3,665 protease decoy DUD-E for HIV-1 (Mysinger, Carchia, Irwin, & Shoichet, 2012)
    2. dataset_test.csv: 10 from top 10 protease inhibitor herbal database Indonesia (Yanuar et al., 2014)
  2. dataset/daftar-senyawa-beserta-binding-energy.csv: docking results of 368 molecules from herbal database Indonesia (Yanuar et al., 2014) which are predicted as HIV-1 protease inhibitor by machine learning model in this research

Raw datasets (*.sdf and *.mol2) can be downloaded at https://drive.google.com/open?id=1X_wkpvSLXXXUPbxmFd7tE5pe0t_njMe_

Experiments

Dependency:

  • Python 3.x
  • Python3-tk (on ubuntu sudo apt install python3-tk)
  • Virtualenv (optional. for isolated environment)

Dependency library installation: pip install -r requirements.txt

Steps:

  1. Extract preprocessed data from dataset/dataset.tar.gz (if you have raw csv data, use python 01-prepare-data.py)
  2. Feature selection with SVM-RFE python 02-feature-selection-svm-rfe.py
  3. Feature selection with Wrapper Method (GA + SVM) python 02-feature-selection-wm.py
  4. Evaluate selected features using PubChem dataset python 03-evaluate-1.py
  5. Evaluate selected features using Indonesian Herbal dataset python 03-evaluate-2.py

Evaluation scripts display accuracy scores in console, save raw results in csv files and display result chart(s) to screen

PubChem Dataset Analysis

PubChem dataset visualizations using t-SNE. Generated by running python visualize-dataset.py:

PubChem t-SNE perplexity=5

PubChem t-SNE perplexity=100

Top 10 PubChem features importance ranking (using Extra Trees):

  1. feature 520 (maxsOH): 0.08817
  2. feature 401 (minsOH): 0.06929
  3. feature 282 (SsOH): 0.02738
  4. feature 110 (nHsOH): 0.02464
  5. feature 163 (nsOH): 0.0211
  6. feature 35 (BCUTw-1l): 0.01432
  7. feature 467 (maxHsOH): 0.01308
  8. feature 406 (minsOm): 0.01307
  9. feature 588 (nAtomP): 0.01254
  10. feature 142 (nsssCH): 0.01231

PubChem Extra Trees feature importance

SVM-RFE also shown that even using 1 feature in PubChem dataset, already give > 80% accuracy. Generated by running python 02-feature-selection-svm-rfe.py:

PubChem Linear SVM + RFE Accuracy per feature set

Comparisons between Linear SVM (no feature selection), Linear SVM + RFE & SVM + Wrapper Method (WM) classification metrics on PubChem dataset. Generated by running python 03-evaluate-1.py:

Receiver Operating Characteristic (ROC) Curves

Classification Accuracy, Sensitivity, Precision and Sensitifity

SVM + RFE Accuracy: 0.9898
SVM + WM Accuracy: 0.9883
SVM Accuracy: 0.9894

HerbalDB Dataset Analysis

Evaluation using selected features on top 10 herbal data (Yanuar et al., 2014). Generated by running python 03-evaluate-2.py:

Classification Accuracy, Sensitivity, Precision and Sensitifity

SVM + RFE Accuracy: 0.5000
SVM + WM Accuracy: 0.6000
SVM Accuracy: 0.5000

K-Means silhouette analysis. Generated by running python 04-evaluate_herbaldb.py:

Silhouette Analysis

For n_clusters = 2 The average silhouette_score is : 0.9359476619168386
For n_clusters = 3 The average silhouette_score is : 0.5895330105970298
For n_clusters = 4 The average silhouette_score is : 0.5863432755171203
For n_clusters = 5 The average silhouette_score is : 0.5664882163184368

The bigger silhouette_score is, the better is its n_clusters

K-Means clustering accuracy on top 10 herbal data (Yanuar et al., 2014) by running: python 04-clustering_herbaldb.py:

K-Means accuracy on top 10 data (Yanuar et al., 2014): 0.30

Citation

@mastersthesis{pujianto2017thesis,
	author={Rahman {Pujianto}},
    title={Drug Candidates Virtual Screening on Indonesian Herbal Plants Database using Machine Learning and Various Feature Selection Strategies},
	school={Universitas Indonesia},
	year={2017},
}