A-robust-method-for-detecting-positive-selection-on-regulatory-sequences-

We developed a method to detect positive selection of transcription factor binding sites (TFBSs) evolution based on binding affinity changes. This is achieved by comparing the observed binding affinity changes in evolution to a null distribution. The effects of substitutions on binding affinity change can be accurately predicted by deltaSVM (Lee et al. 2015), a machine leaning based method to predict the effects of regulatory variations de novo from sequence.

The procedures of detecting positive selection

1). Training of the gapped k-mer support vector machine (gkm-SVM)

Firstly, we defined a positive training set and its corresponding negative training set. The positive training set is ChIP-seq narrow peaks of transcription factors. The negative training set is an equal number of sequences which randomly sampled from the genome with matched the length, GC content and repeat fraction of the positive training set. This negative training set was generated by using “genNullSeqs”, a function of gkm-SVM R package (Ghandi et al. 2016). Then, we trained a gkm-SVM with default parameters except -l=10 (meaning we use 10-mer as feature to distinguish positive and negative training sets). The classification performance of the trained gkm-SVM was measured by using receiver operating characteristic (ROC) curves with fivefold cross-validation. The gkm-SVM training and cross-validation were achieved by using the “gkmtrain” function of “LS-GKM: a new gkm-SVM software for large-scale datasets” (Lee 2016). For details, please check https://github.com/Dongwon-Lee/lsgkm.

2). Generate SVM weights of all possible 10-mers based on the trained gkm-SVM

The SVM weights of all possible 10-mers were generated by using the “gkmpredict” function of “LS-GKM”.

3). Infer ancestor sequence

The ancestor sequence was inferred from sequence alignment with a sister species and an outgroup.

4). Infer positive selection

After we got the SVM weights of all possible 10-mers, and both the ancestor and focal sequences, we infered signal of positive selection by using "testPosSelec.pl". This script was saved in "scripts" folder, and was modified from "deltasvm.pl", a script that calculates deltaSVM scores, which contributed by Lee et al. (2015).

The scripts were used to generate all figures in the paper

Please check "selection_analysis.R" in the "scripts" folder

The data was used to generate all figures in the paper

Please check the "data" folder

Reference

Ghandi M, Mohammad-Noori M, Ghareghani N, Lee D, Garraway L, Beer MA. 2016. gkmSVM: an R package for gapped-kmer SVM. Bioinformatics 32:2205–2207.

Lee D. 2016. LS-GKM: a new gkm-SVM for large-scale datasets. Bioinformatics 32:2196–2198.

Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, Beer MA. 2015. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47:955–961.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
scripts		scripts
supplementary_figures_tables		supplementary_figures_tables
README		README
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A-robust-method-for-detecting-positive-selection-on-regulatory-sequences-

About

Releases

Packages

Languages

ljljolinq1010/A-robust-method-for-detecting-positive-selection-on-regulatory-sequences

Folders and files

Latest commit

History

Repository files navigation

A-robust-method-for-detecting-positive-selection-on-regulatory-sequences-

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages