HIV Project util functions

This python package regroups useful functions and classes, that were used for the study of resistance mutations, and the search for potentially resistance-associated mutations in HIV-1 Reverse transcriptase sequences, using machine learning methods.
You can read about this project here and here

Module description

This module is separated into 5 different submodules

DRM utils

This submodule contains all functions to get different subsets of DRMs (ie. NRTIs, NNRTIs, accessory DRMs, SDRMs, etc...). Each of the functions returns a list of selected DRMs.

data utils

This submodule contains useful functions and classes to pre-process the encoded dataset before model training. You can remove features corresponding to known DRMs, remove sequences that have DRMs, balance target classes by sub-sampling or over-sampling, or creating cross-validation folds.

learning utils

This submodule contains useful functions and classes to use classifiers needed during the study. It also contains custom classifiers based on exact fisher tests. It contains functions to train classifiers, get predictions from these classifiers and extract coefficients / weights from these classifiers.

param utils

This submodule contains functions useful for the generation and selection of the best hyper-parameter set via random search.

metrics

This submodule contains a set of custom performance metrics that we devised in an attempt to take into account class imbalance and the differing importance given to False positives (more important) and False negatives (less important).

independent scripts

Additionally, two useful scripts are present.

compute_fisher_values.py

This script allows us to compute p-values for Fisher exact tests comparing the prevalence of mutations w.r.t a binary character like RTI treatment status or presence/absence of any DRM. This outputs a table with each considered mutation in a row and the raw p-value, as well as p-values corrected for multiple testing with the Bonferroni, Benjamini-Hochberg or Benjamini-Yekutieli methods. This script was used to generate the table: utils_hiv/data/fisher_p_values.tsv

data_encoder.py

This script is used to create the OneHot encoded dataset from HIVDB files and an additional metadata file.
To run this script you need the PrettyRTAA_naive.tsv and PrettyRTAA_treated.tsv generated by submitting the naive.fa and treated.fa fasta alignments to the HIVDB sequence program. This also outputs ResistanceSummary_naive.tsv and ResistanceSummary_treated.tsv which are needed for the script to run.
This script can be used to specify starting and ending positions.

data files

These files are in utils_hiv/data and are used by submodules.

DRM files

NRTI.tab and NNRTI.tab are local copies of HIVDB files (1, 2).
mutation_characteristic.tab is used by the DRM_utils submodule and contains known DRMs with their type (NRTI,NNRTI,Other), their SDRM status. This was obtained through the HIVDB program and hand-curated. The accessory/primary role of each mutation was determined by the HIVDB program comment.

consensus.fa

This file contains the reference sequences for the main HIV-1 subtypes present in our datasets. These sequences were obtained from the Los Alamos HIV sequence database, they are used to determine what features to remove when encoding sequences.

fisher_p_values.tsv

This file contains the results of fisher exact tests for all mutations in the datasets w.r.t to treatment or DRM presence/absence, with raw and corrected (for multiple testing) p-values. These p-values are used to build our "Fisher classifiers".

dependencies

This module depends on the following python packages:

python 3.7.6
pandas 0.25.3
scikit-learn 0.20.3
biopython 1.74
statsmodels 0.9.0
category_encoders 1.3.0
scipy 1.4.1
numpy 1.18.1

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
utils_hiv		utils_hiv
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HIV Project util functions

Module description

DRM utils

data utils

learning utils

param utils

metrics

independent scripts

compute_fisher_values.py

data_encoder.py

data files

DRM files

consensus.fa

fisher_p_values.tsv

dependencies

About

Releases

Packages

Contributors 2

Languages

lucblassel/utils_hiv

Folders and files

Latest commit

History

Repository files navigation

HIV Project util functions

Module description

DRM utils

data utils

learning utils

param utils

metrics

independent scripts

compute_fisher_values.py

data_encoder.py

data files

DRM files

consensus.fa

fisher_p_values.tsv

dependencies

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages