We developed a method to detect positive selection of transcription factor binding sites (TFBSs) evolution based on binding affinity changes. This is achieved by comparing the observed binding affinity changes in evolution to a null distribution. The effects of substitutions on binding affinity change can be accurately predicted by deltaSVM (Lee et al. 2015), a machine leaning based method to predict the effects of regulatory variations de novo from sequence.
- The procedures of detecting positive selection
1). Training of the gapped k-mer support vector machine (gkm-SVM)
Firstly, we defined a positive training set and its corresponding negative training set. The positive training set is ChIP-seq narrow peaks of transcription factors. The negative training set is an equal number of sequences which randomly sampled from the genome with matched the length, GC content and repeat fraction of the positive training set. This negative training set was generated by using “genNullSeqs”, a function of gkm-SVM R package (Ghandi et al. 2016). Then, we trained a gkm-SVM with default parameters except -l=10 (meaning we use 10-mer as feature to distinguish positive and negative training sets). The classification performance of the trained gkm-SVM was measured by using receiver operating characteristic (ROC) curves with fivefold cross-validation. The gkm-SVM training and cross-validation were achieved by using the “gkmtrain” function of “LS-GKM: a new gkm-SVM software for large-scale datasets” (Lee 2016). For details, please check https://github.com/Dongwon-Lee/lsgkm.
2). Generate SVM weights of all possible 10-mers based on the trained gkm-SVM
The SVM weights of all possible 10-mers were generated by using the “gkmpredict” function of “LS-GKM”.
3). Infer ancestor sequence
The ancestor sequence was inferred from sequence alignment with a sister species and an outgroup.
4). Infer positive selection
After we got the SVM weights of all possible 10-mers, and both the ancestor and focal sequences, we infered signal of positive selection by using "testPosSelec.pl". This script was saved in "scripts" folder, and was modified from "deltasvm.pl", a script that calculates deltaSVM scores, which contributed by Lee et al. (2015).
- The scripts were used to generate all figures in the paper
Please check "selection_analysis.R" in the "scripts" folder
- The data was used to generate all figures in the paper
Please check the "data" folder
- Reference
Ghandi M, Mohammad-Noori M, Ghareghani N, Lee D, Garraway L, Beer MA. 2016. gkmSVM: an R package for gapped-kmer SVM. Bioinformatics 32:2205–2207.
Lee D. 2016. LS-GKM: a new gkm-SVM for large-scale datasets. Bioinformatics 32:2196–2198.
Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, Beer MA. 2015. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47:955–961.