eGADA (enhanced Genomic Alteration Detection Algorithm) is an enhanced version of GADA (Pique-Regi 2007, Pique-Regi 2008).
It is a fast segmentation algorithm utilizing the Sparse Bayesian Learning (or Relevance Vector Machine) technique from Tipping 2001. It can be applied to array intensity data, sequencing coverage data, or any sequential data that displays characteristics of step-wise functions.
eGADA (Huang 2023) vs GADA (Pique-Regi 2007):
- Use a customized Red-Black tree to expedite the final backward elimination step.
- Code in C++, not C.
- Use Boost libraries extensively.
- More friendly help and commandline-argument processing.
- More friendly input and output formats.
- A dynamic library eGADA.so (packaged via Boost.Python) that offers API to Python.
- Minor changes/bugfixes.
- libboost-program-options-dev: Boost program-options dev library
- libboost-iostreams-dev: Boost iostreams dev library
- libboost-python-dev: Boost python dev library
cd src;
make
./src/eGADA -i ../data/input.txt -o ../data/output2.tsv.gz
- The input is a single-column plain text file. It can also read gzipped file (.gz).
- The output is a 4-column tsv file (both .tsv and .tsv.gz are supported.):
- Start is the starting index (1-based) of the segment.
- Stop is the ending index (1-based and inclusive) of the segment.
- Length is the number of data points/bins/probes included in the segment.
- Ampl (Amplitude) is the average amplitude/coverage of the segment.
# eGADA: enhanced Genome Alteration Detection Algorithm
# Authors: Yu Huang polyactis@gmail.com, Roger Pique-Regi piquereg@usc.edu
# Parameters: a=0.2,T=5,MinSegLen=0,sigma2=0.207949,BaseAmp=0, convergenceDelta=1e-08, maxNoOfIterations=50000, convergenceMaxAlpha=1e+08, convergenceB=1e-20.
# Reading M=80000 probes in input file
# Overall mean 0.00135799
# Sigma^2=0.207949
# Convergence: delta=9.88473e-09 after 924 EM iterations.
# Found 1915 breakpoints after SBL
# Kept 442 breakpoints after BE
Start Stop Length Ampl
1 25 25 0.881578
26 75 50 -1.08218
76 125 50 1.04644
126 175 50 -0.973855
176 226 51 0.912763
227 275 49 -0.954001
276 325 50 0.963878
326 375 50 -1.16553
376 427 52 0.948964
...
- Type
eGADA
oreGADA -h
for more help.
yh@fusilier:~/src/eGADA/src$ ./eGADA
program name is ./eGADA.
Usage:
./eGADA -i INPUTFNAME -o OUTPUTFNAME [OPTIONS]
-h [ --help ] Produce this help message.
-T [ --TBackElim ] arg (=5) Minimal T-stat during the backward
elimination of breakpoints. T-stat =
(mean1-mean2)/stddev between two
adjacent segments.
-a [ --aAlpha ] arg (=0.5) SBL hyper prior parameter for a
breakpoint. It is the shape parameter
of the Gamma distribution. Higher
(lower) value means less (more)
breakpoints.
-M [ --MinSegLen ] arg (=0) Minimal length required for any
segment.
--BaseAmp arg (=0) The amplitude for the copy-neutral
state. It is used if the -c option
value is non-zero and the algorithm
need to classify segments into
normal/gain/loss.
-s [ --sigma2 ] arg (=-1) Variance of input data values. If
negative, it will be estimated by the
algorithm. We recommend it to be
estimated by the algorithm (~ trimmed
mean.
-c [ --SelectClassifySegments ] arg (=0)
Non-zero value to classify segments
into normal/gain/loss state.
--SelectEstimateBaseAmp arg (=1) Non-zero value to estimate BaseAmp from
data, rather than user-supplied.
--convergenceDelta arg (=1e-08) A delta number controlling convergence
in the EM algorithm
--maxNoOfIterations arg (=50000) The maximum number of iterations before
the EM convergence algorithm is
stopped.
--convergenceMaxAlpha arg (=100000000)
One convergence related number.
--convergenceB arg (=9.9999999999999995e-21)
one convergence related number.
-b [ --debug ] Toggle debug/verbose mode for more
status output
-r [ --report ] Toggle report mode
--reportIntervalDuringBE arg (=100000)
How often to report any break point
removed during backward elimination.
-i [ --inputFname ] arg Input file path. It could be specified
as an option or positional argument. If
the suffix is .gz, the software will
unzip it upon reading. It is a
single-column text file with no header.
-o [ --outputFname ] arg Output filepath. If the suffix is .gz,
the software will zip the output
automatically.
Examples:
./eGADA -i /tmp/input.tsv.gz -o /tmp/output.tsv.gz -M 10 --convergenceDelta 0.001
./src/testGADA.py -i ./data/input.txt -o ./data/output_a0.5T4M5.tsv
- https://hub.docker.com/repository/docker/polyactis/egada
- Binary and testGADA.py are deposited in /opt/eGADA.
- Run
docker pull polyactis/egada
orsingularity pull docker://polyactis/egada
.
- Huang YS. eGADA: enhanced Genomic Alteration Detection Algorithm. bioRxiv. 2023
- Tipping ME. Sparse Bayesian learning and the relevance vector machine. Journal of machine learning research. 2001;1(Jun):211-44.
- Pique-Regi R, Tsau ES, Ortega A, Seeger R, Asgharzadeh S. Wavelet footprints and sparse bayesian learning for DNA copy number change analysis. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 2007 Apr 15 (Vol. 1, pp. I-353). IEEE.
- Pique-Regi R, Monso-Varona J, Ortega A, Seeger RC, Triche TJ, Asgharzadeh S. Sparse representation and Bayesian detection of genome copy number alterations from microarray data. https://github.com/rpique/GADA. Bioinformatics. 2008 Feb 1;24(3):309-18.
- https://github.com/isglobal-brge/R-GADA: R implementation of GADA.