This python package regroups useful functions and classes, that were used for the study of resistance mutations, and the search for potentially resistance-associated mutations in HIV-1 Reverse transcriptase sequences, using machine learning methods.
You can read about this project here and here
This module is separated into 5 different submodules
This submodule contains all functions to get different subsets of DRMs (ie. NRTIs, NNRTIs, accessory DRMs, SDRMs, etc...). Each of the functions returns a list of selected DRMs.
This submodule contains useful functions and classes to pre-process the encoded dataset before model training. You can remove features corresponding to known DRMs, remove sequences that have DRMs, balance target classes by sub-sampling or over-sampling, or creating cross-validation folds.
This submodule contains useful functions and classes to use classifiers needed during the study. It also contains custom classifiers based on exact fisher tests. It contains functions to train classifiers, get predictions from these classifiers and extract coefficients / weights from these classifiers.
This submodule contains functions useful for the generation and selection of the best hyper-parameter set via random search.
This submodule contains a set of custom performance metrics that we devised in an attempt to take into account class imbalance and the differing importance given to False positives (more important) and False negatives (less important).
Additionally, two useful scripts are present.
This script allows us to compute p-values for Fisher exact tests comparing the prevalence of mutations w.r.t a binary character like RTI treatment status or presence/absence of any DRM. This outputs a table with each considered mutation in a row and the raw p-value, as well as p-values corrected for multiple testing with the Bonferroni, Benjamini-Hochberg or Benjamini-Yekutieli methods. This script was used to generate the table: utils_hiv/data/fisher_p_values.tsv
This script is used to create the OneHot encoded dataset from HIVDB files and an additional metadata file.
To run this script you need the PrettyRTAA_naive.tsv
and PrettyRTAA_treated.tsv
generated by submitting the naive.fa
and treated.fa
fasta alignments to the HIVDB sequence program. This also outputs ResistanceSummary_naive.tsv
and ResistanceSummary_treated.tsv
which are needed for the script to run.
This script can be used to specify starting and ending positions.
These files are in utils_hiv/data
and are used by submodules.
NRTI.tab
and NNRTI.tab
are local copies of HIVDB files (1, 2).
mutation_characteristic.tab
is used by the DRM_utils
submodule and contains known DRMs with their type (NRTI,NNRTI,Other), their SDRM status. This was obtained through the HIVDB program and hand-curated. The accessory/primary role of each mutation was determined by the HIVDB program comment.
This file contains the reference sequences for the main HIV-1 subtypes present in our datasets. These sequences were obtained from the Los Alamos HIV sequence database, they are used to determine what features to remove when encoding sequences.
This file contains the results of fisher exact tests for all mutations in the datasets w.r.t to treatment or DRM presence/absence, with raw and corrected (for multiple testing) p-values. These p-values are used to build our "Fisher classifiers".
This module depends on the following python packages:
- python 3.7.6
- pandas 0.25.3
- scikit-learn 0.20.3
- biopython 1.74
- statsmodels 0.9.0
- category_encoders 1.3.0
- scipy 1.4.1
- numpy 1.18.1