Succinct Colored de Bruijn Graph
Colorgram requirese five third party packages:
- KMC tool - https://github.com/refresh-bio/KMC
- STXXL - https://github.com/stxxl/stxxl
- SDSL-Lite - https://github.com/simongog/sdsl-lite
- Sparsepp - https://github.com/greg7mdp/sparsepp
- Google Test - https://github.com/google/googletest
You can simply run ./create_environment.sh
that creates the 3rd_party folder, clones and complies everything there automatically.
To create the succinct colored de Bruijn graph run colorgram_tool.sh
with the following parameters:
./colorgram_tool.sh -k=32 <file-list> <kmc-files-dir> <out-filename>
OR
./colorgram_tool.sh -k=32 <fasta-file> <kmc-files-dir> <out-filename> -m
In the first case you have to specify a <file-list>
that contains file names (with absolute paths) where each file is in FASTA-format.
In this case each line in the <file-list>
correspond to one color. The <kmc-files-dir>
should be an arbitrary empty (or non-existent)
directory that is the temporary working directory storing the KMC files output by the KMC tool. Finally the <out-filename>
should be the name of your desired output file. Colorgram will create 3 different files: out-filename.dbg (storing the succinct
de Bruijn graph), out-filename.x (storing the label vector) and out-filename.ct (storing the color table).
The second case is when you specify the -m
argument. Here you have to give one FASTA-format file <fasta-file>
. The
number of colors will be the the number of reads in <fasta-file>
.
Note that in both cases the FASTA-files should be valid meaning that each read should contain only ACTG
characters and
for technical reasons each read should be stored on ONE line. For examples you can check the Testing part.
If you want to use the features of Colorgram, you have to manually compile colorgram-stats
program. By default when you
run the colorgram_tool.sh
it creates a build
folder and compiles the programs there. You can simply run
build/colorgram-stats <out-filename>
where you have to specify the same file name that you used for the building (there
must exist out-filename.dbg, out-filename.x and out-filename.ct files).
Note that Bubble Calling is not yet added.
Colorgram was tested with k<=63 values - note that KMC tool has a bug that sometimes close to k=64 it generates corrupted k-mer values.
You can simply run
./colorgram_tool.sh -k=4 tests/test_lst.txt tests/edges tests/out_tests
AND
./colorgram_tool.sh -k=4 tests/test_kmers_all.fna tests/edges2 tests/out_tests2 -m
Then you can cd to build
and run ./succinct-dbg-tests
.