Given a minimum segment length and m sequences of length n drawn from an alphabet of size σ, create a segmentation in O(mn log σ) time and use various matching strategies to join the segment texts to generate founder sequences. Please see releases for pre-built binaries.
If you use the software in an academic setting we kindly ask you to cite the following paper:
@article{NorriCKM19,
author = {Tuukka Norri and
Bastien Cazaux and
Dmitry Kosolobov and
Veli M{\"{a}}kinen},
title = {Linear time minimum segmentation enables scalable founder reconstruction},
journal = {Algorithms for Molecular Biology},
volume = {14},
number = {1},
pages = {12:1--12:15},
year = {2019},
url = {https://doi.org/10.1186/s13015-019-0147-6},
doi = {10.1186/s13015-019-0147-6},
timestamp = {Mon, 29 Jul 2019 15:58:48 +0200}
}
On Linux the following libraries are required:
- A recent version of Clang and libclang. C++17 support is required. Building the tools has been tested with Clang 6.0.
- GNU gengetopt (tested with version 2.22.6)
- CMake
- Boost
- On Linux, libdispatch dependencies including Ninja and SystemTap development files are required.
For installing libdispatch dependencies, the package list in the Building and installing for Linux section in the libdispatch installation guide can be helpful. On Linux, libdispatch itself is built as part of our build process, so it does not need to be installed. On macOS, the operating system libraries are used instead.
git clone --recursive https://github.com/tsnorri/founder-sequences.git
cd founder-sequences
cp linux-static.local.mk local.mk
- Edit local.mk.
make -j12
- Clone the repository with
git clone --recursive https://github.com/tsnorri/founder-sequences.git
. - Change the working directory with
cd founder-sequences
. - Create the file
local.mk
.linux-static.local.mk
is provided as an example and may be copied withcp linux-static.local.mk local.mk
- Edit
local.mk
in the repository root to override build variables. Useful variables includeCC
,CXX
, andGENGETOPT
for C and C++ compilers and gengetopt respectively.BOOST_ROOT
is used to determine the location of Boost headers and libraries.BOOST_LIBS
andLIBDISPATCH_LIBS
are passed to the linker. Seecommon.mk
for additional variables. - Run make with a suitable numer of parallel jobs, e.g.
make -j12
Useful make targets include:
- all
- Build everything
- clean
- Remove build products except for dependencies (in the
lib
folder). - clean-all
- Remove all build products.
The package contains founder_sequences
as well as some auxiliary tools.
Takes a text file that contains a list of sequence file paths as its input. A FASTA file with short (less than 1 kb) lines may be used instead. It generates a segmentation with substrings not shorter than the value given with --segment-length-bound
. It then proceeds to join the segments with the joining method specified with --segment-joining
and writes the founder sequences to the path given with --output-founders
one sequence per line. In addition, the segments may be written to a separate file with --output-segments
.
founder_sequences --input=input-list.txt --segment-length-bound=10 --output-segments=segments.txt --output-founders=founders.txt
input-list.txt
should contain the paths of the sequence files, one path per line. The sequence files should contain one sequence in each file without the terminating newline. The segment length bound specifies the minimum segment length.
Reads the aligned texts file paths given from a given list. Outputs the reduced texts to files created in the current directory. The identity columns will be listed as a sequence of zeros and ones (indicates identity) to the standard output.
Given a set of founder sequences, a reference sequence and a list of identity columns, outputs the founder sequences with the identity columns included.
Matches sequences to founder sequences and outputs statistics. Uses a greedy algorithm to find the longest match in the set of founders.