PATO: high PerformAnce TriplexatOr is a high performance tool for the fast and efficient detection of triple helices and triplex features in nucleotide sequences. PATO: high PerformAnce TriplexatOr is a modern alternative to Triplexator and TDF and functions as a drop in replacement to accelerate the triplex analyses in multicore computers. Moreover, PATO: high PerformAnce TriplexatOr's efficiency allows a more exhaustive search of the triplex-forming solution space, and so it achieves higher levels of prediction accuracy in far less time than other tools in the state of the art.
Download the source code from this repository, either use Git or download a copy from GitHub, and let CMake compile PATO: high PerformAnce TriplexatOr for you:
$ cmake -B build . && cmake --build build
Note that macOS users must explicitly specify an OpenMP-enabled compiler to
compile PATO: high PerformAnce TriplexatOr. For instance, to use g++-12
(available via Homebrew), execute:
$ cmake -B build -D CMAKE_CXX_COMPILER=g++-12 . && cmake --build build
Now that PATO: high PerformAnce TriplexatOr has been compiled, execute the application as follows:
$ ./build/tools/PATO/PATO [options] {-ss tfo_file | -ds tts_file | -ss tfo_file -ds tts_file}
Execute ./build/tools/PATO/PATO --help
for a detailed list of execution modes,
options, and flags.
To predict Triplex-forming Oligonucleotides (TFOs): run PATO with a single-stranded sequence file to detect regions that may form triplexes:
$ ./build/tools/PATO/PATO -ss single_stranded.fa
This will generate a file containing TFO regions.
To predict Triplex Target Sites (TTSs): run PATO with a double-stranded sequence file to detect regions in the sequences that may serve as targets for triplex formation:
$ ./build/tools/PATO/PATO -ds double_stranded.fa
This will generate a file containing TTS regions.
To predict potential triplexes: match TFO regions from a single-stranded sequence file with TTS regions from a double-stranded sequence file:
$ ./build/tools/PATO/PATO -ss single_stranded.fa -ds double_stranded.fa
This will produce a file containing all individual triple helices found between the sequences, and another file listing the interaction strength between each sequence pair.
To select candidate triplexes, sort the interactions between each sequence pair
according to Total (rel)
and study the strongest triplexes.
PATO: high PerformAnce TriplexatOr uses OpenMP to parallelize its triplex search
algorithms. The OpenMP runtime will automatically spawn as many threads as there
are available CPU cores. To reduce the number of threads spawned by the
application one has to explicitly set the OMP_NUM_THREADS
environment variable
to a value greater than 0. For instance, to run PATO: high PerformAnce
TriplexatOr with 4 threads, execute:
$ OMP_NUM_THREADS=4 ./build/tools/PATO/PATO ...
To reduce the memory footprint of PATO: high PerformAnce TriplexatOr, one can
set the maximum number of sequences that may processed simultaneously by the
triplex search algorithms. This is done by setting the -cs
or --chunk-size
option to a value greater than 0 (128 by default). For instance, to process a
dataset in chunks of 32 sequences, execute:
$ ./build/tools/PATO/PATO --chunk-size 32 ...
To give an upper bound of the memory consumption of PATO: high PerformAnce TriplexatOr, one can use the following formula:
where
It is possible to further reduce the memory usage of the application by
disabling the filtering of low-complex regions. This can be done by setting the
-fr
or --filter-repeats
option to false
. In such a case, sequences should
be filtered before being passed to PATO: high PerformAnce TriplexatOr (the
Ensembl genome browser provides filtered sequences)
and the formula becomes:
In general, one can't go wrong by setting the number of simultaneous sequences to a value equal to the number of threads that PATO: high PerformAnce TriplexatOr is going to use. However, if the sequences of a dataset are very long, it may be necessary to reduce the number of simultaneous sequences to avoid running out of memory.
If you are unsure about the number of simultaneous sequences to use, you can set
the -cs
or --chunk-size
option to 1. Although this may hurt parallelism by a
small amount, it will allow you to run PATO: high PerformAnce TriplexatOr on any
dataset without having to worry about the memory footprint of the application.
If you use PATO: high PerformAnce TriplexatOr in your research, please cite our work using the following reference:
@article{amatria2023pato,
title={PATO: genome-wide prediction of {lncRNA--DNA} triple helices},
author={Amatria-Barral, I{\~n}aki and Gonz{\'a}lez-Dom{\'\i}nguez, Jorge and Touri{\~n}o, Juan},
journal={Bioinformatics},
volume={39},
number={3},
pages={btad134},
year={2023}
}
PATO: high PerformAnce TriplexatOr is free software and as such it is distributed under the MIT License.