Skip to content

Commit

Permalink
add support for fasta inputs
Browse files Browse the repository at this point in the history
  • Loading branch information
shubhamchandak94 committed Jan 3, 2020
1 parent 02370c4 commit 3a01ec2
Show file tree
Hide file tree
Showing 13 changed files with 478 additions and 46 deletions.
51 changes: 31 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,49 +72,51 @@ Allowed options:
-c [ --compress ] compress
-d [ --decompress ] decompress
--decompress-range arg --decompress-range start end
(optional) decompress only reads (or read
pairs for PE datasets) from start to end
(both inclusive) (1 <= start <= end <=
num_reads (or num_read_pairs for PE)). If -r
was specified during compression, the range
of reads does not correspond to the original
(optional) decompress only reads (or read
pairs for PE datasets) from start to end
(both inclusive) (1 <= start <= end <=
num_reads (or num_read_pairs for PE)). If -r
was specified during compression, the range
of reads does not correspond to the original
order of reads in the FASTQ file.
-i [ --input-file ] arg input file name (two files for paired end)
-o [ --output-file ] arg output file name (for paired end
-o [ --output-file ] arg output file name (for paired end
decompression, if only one file is specified,
two output files will be created by suffixing
.1 and .2.)
-w [ --working-dir ] arg (=.) directory to create temporary files (default
-w [ --working-dir ] arg (=.) directory to create temporary files (default
current directory)
-t [ --num-threads ] arg (=8) number of threads (default 8)
-r [ --allow-read-reordering ] do not retain read order during compression
-r [ --allow-read-reordering ] do not retain read order during compression
(paired reads still remain paired)
--no-quality do not retain quality values during
--no-quality do not retain quality values during
compression
--no-ids do not retain read identifiers during
--no-ids do not retain read identifiers during
compression
-q [ --quality-opts ] arg quality mode: possible modes are
1. -q lossless (default)
2. -q qvz qv_ratio (QVZ lossy compression,
parameter qv_ratio roughly corresponds to
2. -q qvz qv_ratio (QVZ lossy compression,
parameter qv_ratio roughly corresponds to
bits used per quality value)
3. -q ill_bin (Illumina 8-level binning)
4. -q binary thr high low (binary (2-level)
thresholding, quality binned to high if >=
4. -q binary thr high low (binary (2-level)
thresholding, quality binned to high if >=
thr and to low if < thr)
-l [ --long ] Use for compression of arbitrarily long read
lengths. Can also provide better compression
for reads with significant number of indels.
-r disabled in this mode. For Illumina short
-l [ --long ] Use for compression of arbitrarily long read
lengths. Can also provide better compression
for reads with significant number of indels.
-r disabled in this mode. For Illumina short
reads, compression is better without -l flag.
-g [ --gzipped_fastq ] enable if compression input is gzipped fastq
or to output gzipped fastq during
decompression
--fasta-input enable if compression input is fasta file
(i.e., no qualities)
```
Note that the SPRING compressed files are tar archives consisting of the different compressed streams, although we recommend using the `.spring` extension as in the examples shown below.

### Resource usage
For the memory and CPU performance for SPRING, please see the paper and the associated supplementary material. Note that SPRING uses some temporary disk space, and can fail if the disk space is not sufficient. Assuming that qualities and ids are not being discarded and SPRING is operating in the short read mode, the additional temporary disk usage is around 10-30% of the original uncompressed file (on the lower end when quality values are from newer Illumina machines and are more compressible) when -r flag is not specified (i.e., default lossless mode). When -r flag is specified, SPRING writes all the quality values and read ids to a temporary file leading to significantly higher temporary disk usage - closer to 70-80% of the original file size. Note that these figures are approximate and include the space needed for the final compressed file.
For the memory and CPU performance for SPRING, please see the paper and the associated supplementary material. Note that SPRING uses some temporary disk space, and can fail if the disk space is not sufficient. Assuming that qualities and ids are not being discarded and SPRING is operating in the short read mode, the additional temporary disk usage is around 10-30% of the original uncompressed file (on the lower end when quality values are from newer Illumina machines and are more compressible) when -r flag is not specified (i.e., default lossless mode). When -r flag is specified, SPRING writes all the quality values and read ids to a temporary file leading to significantly higher temporary disk usage - closer to 70-80% of the original file size. Note that these figures are approximate and include the space needed for the final compressed file.

### Example Usage of SPRING
This section contains several examples for SPRING compression and decompression with various modes and options. The compressed SPRING file uses the `.spring` extension as a convention.
Expand Down Expand Up @@ -183,3 +185,12 @@ Decompressing (paired end) to file_1.fastq and file_2.fastq, only decompress pai
```bash
./spring -d -i file.spring -o file_1.fastq file_2.fastq --decompress-range 4000000 8000000
```
Compressing file_1.fasta and file_2.fasta (fasta files without qualities) losslessly using default 8 threads (Lossless).
```bash
./spring -c -i file_1.fasta file_2.fasta -o file.spring --fasta-input
```

Compressing (paired end) to file_1.fasta and file_2.fasta (previous example contd.).
```bash
./spring -d -i file.spring -o file_1.fasta file_2.fasta
```
8 changes: 5 additions & 3 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ int main(int argc, char** argv) {
namespace po = boost::program_options;
bool help_flag = false, compress_flag = false, decompress_flag = false,
pairing_only_flag = false, no_quality_flag = false, no_ids_flag = false,
long_flag = false, gzip_flag = false;
long_flag = false, gzip_flag = false, fasta_flag = false;
std::vector<std::string> infile_vec, outfile_vec, quality_opts;
std::vector<uint64_t> decompress_range_vec;
std::string working_dir;
Expand Down Expand Up @@ -89,7 +89,9 @@ int main(int argc, char** argv) {
"reads, compression is better without -l flag.")(
"gzipped-fastq,g", po::bool_switch(&gzip_flag),
"enable if compression input is gzipped fastq or to output gzipped fastq "
"during decompression");
"during decompression")(
"fasta-input", po::bool_switch(&fasta_flag),
"enable if compression input is fasta file (i.e., no qualities)");
po::variables_map vm;
po::store(po::parse_command_line(argc, argv, desc), vm);
po::notify(vm);
Expand Down Expand Up @@ -131,7 +133,7 @@ int main(int argc, char** argv) {
if (compress_flag)
spring::compress(temp_dir, infile_vec, outfile_vec, num_thr,
pairing_only_flag, no_quality_flag, no_ids_flag,
quality_opts, long_flag, gzip_flag);
quality_opts, long_flag, gzip_flag, fasta_flag);
else
spring::decompress(temp_dir, infile_vec, outfile_vec, num_thr,
decompress_range_vec, gzip_flag);
Expand Down
4 changes: 2 additions & 2 deletions src/preprocess.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ namespace spring {

void preprocess(const std::string &infile_1, const std::string &infile_2,
const std::string &temp_dir, compression_params &cp,
const bool &gzip_flag) {
const bool &gzip_flag, const bool &fasta_flag) {
std::string infile[2] = {infile_1, infile_2};
std::string outfileclean[2];
std::string outfileN[2];
Expand Down Expand Up @@ -158,7 +158,7 @@ void preprocess(const std::string &infile_1, const std::string &infile_2,
done[j] = false;
std::string *id_array = (j == 0) ? id_array_1 : id_array_2;
uint32_t num_reads_read = read_fastq_block(
fin[j], id_array, read_array, quality_array, num_reads_per_step);
fin[j], id_array, read_array, quality_array, num_reads_per_step, fasta_flag);
if (num_reads_read < num_reads_per_step) done[j] = true;
if (num_reads_read == 0) continue;
if (num_reads[0] + num_reads[1] + num_reads_read > MAX_NUM_READS) {
Expand Down
2 changes: 1 addition & 1 deletion src/preprocess.h
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ namespace spring {

void preprocess(const std::string &infile_1, const std::string &infile_2,
const std::string &temp_dir, compression_params &cp,
const bool &gzip_flag);
const bool &gzip_flag, const bool &fasta_flag);

} // namespace spring

Expand Down
6 changes: 4 additions & 2 deletions src/spring.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ void compress(const std::string &temp_dir,
const bool &pairing_only_flag, const bool &no_quality_flag,
const bool &no_ids_flag,
const std::vector<std::string> &quality_opts,
const bool &long_flag, const bool &gzip_flag) {
const bool &long_flag, const bool &gzip_flag, const bool &fasta_flag) {
//
// Ensure that omp parallel regions are executed with the requested
// #threads.
Expand All @@ -60,6 +60,8 @@ void compress(const std::string &temp_dir,
preserve_order = !pairing_only_flag;
preserve_id = !no_ids_flag;
preserve_quality = !no_quality_flag;
if (fasta_flag)
preserve_quality = false;
switch (infile_vec.size()) {
case 0:
throw std::runtime_error("No input file specified");
Expand Down Expand Up @@ -135,7 +137,7 @@ void compress(const std::string &temp_dir,

std::cout << "Preprocessing ...\n";
auto preprocess_start = std::chrono::steady_clock::now();
preprocess(infile_1, infile_2, temp_dir, cp, gzip_flag);
preprocess(infile_1, infile_2, temp_dir, cp, gzip_flag, fasta_flag);
auto preprocess_end = std::chrono::steady_clock::now();
std::cout << "Preprocessing done!\n";
std::cout << "Time for this step: "
Expand Down
2 changes: 1 addition & 1 deletion src/spring.h
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ void compress(const std::string &temp_dir,
const bool &pairing_only_flag, const bool &no_quality_flag,
const bool &no_ids_flag,
const std::vector<std::string> &quality_opts,
const bool &long_flag, const bool &gzip_flag);
const bool &long_flag, const bool &gzip_flag, const bool &fasta_flag);

void decompress(const std::string &temp_dir,
const std::vector<std::string> &infile_vec,
Expand Down
16 changes: 9 additions & 7 deletions src/util.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -30,22 +30,24 @@ namespace spring {

uint32_t read_fastq_block(std::istream *fin, std::string *id_array,
std::string *read_array, std::string *quality_array,
const uint32_t &num_reads) {
const uint32_t &num_reads, const bool &fasta_flag) {
uint32_t num_done = 0;
std::string comment;
for (; num_done < num_reads; num_done++) {
if (!std::getline(*fin, id_array[num_done])) break;
remove_CR_from_end(id_array[num_done]);
if (!std::getline(*fin, read_array[num_done]))
throw std::runtime_error(
"Invalid FASTQ file. Number of lines not multiple of 4");
"Invalid FASTQ(A) file. Number of lines not multiple of 4(2)");
remove_CR_from_end(read_array[num_done]);
if (fasta_flag)
continue;
if (!std::getline(*fin, comment))
throw std::runtime_error(
"Invalid FASTQ file. Number of lines not multiple of 4");
"Invalid FASTQ(A) file. Number of lines not multiple of 4(2)");
if (!std::getline(*fin, quality_array[num_done]))
throw std::runtime_error(
"Invalid FASTQ file. Number of lines not multiple of 4");
"Invalid FASTQ(A) file. Number of lines not multiple of 4(2)");
remove_CR_from_end(quality_array[num_done]);
}
return num_done;
Expand Down Expand Up @@ -276,14 +278,14 @@ void write_dna_in_bits(const std::string &read, std::ofstream &fout) {
for (int i = 0; i < readlen / 4; i++) {
bitarray[pos_in_bitarray] = 0;
for (int j = 0; j < 4; j++)
bitarray[pos_in_bitarray] |= (dna2int[(uint8_t)read[4 * i + j]]<<(2*j));
bitarray[pos_in_bitarray] |= (dna2int[(uint8_t)read[4 * i + j]]<<(2*j));
pos_in_bitarray++;
}
if (readlen % 4 != 0) {
int i = readlen / 4;
bitarray[pos_in_bitarray] = 0;
for (int j = 0; j < readlen % 4; j++)
bitarray[pos_in_bitarray] |= (dna2int[(uint8_t)read[4 * i + j]]<<(2*j));
bitarray[pos_in_bitarray] |= (dna2int[(uint8_t)read[4 * i + j]]<<(2*j));
pos_in_bitarray++;
}
fout.write((char *)&bitarray[0], pos_in_bitarray);
Expand Down Expand Up @@ -330,7 +332,7 @@ void write_dnaN_in_bits(const std::string &read, std::ofstream &fout) {
for (int i = 0; i < readlen / 2; i++) {
bitarray[pos_in_bitarray] = 0;
for (int j = 0; j < 2; j++)
bitarray[pos_in_bitarray] |= (dna2int[(uint8_t)read[2 * i + j]]<<(4*j));
bitarray[pos_in_bitarray] |= (dna2int[(uint8_t)read[2 * i + j]]<<(4*j));
pos_in_bitarray++;
}
if (readlen % 2 != 0) {
Expand Down
2 changes: 1 addition & 1 deletion src/util.h
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ struct compression_params {

uint32_t read_fastq_block(std::istream *fin, std::string *id_array,
std::string *read_array, std::string *quality_array,
const uint32_t &num_reads);
const uint32_t &num_reads, const bool &fasta_flag);

void write_fastq_block(std::ofstream &fout, std::string *id_array,
std::string *read_array, std::string *quality_array,
Expand Down
Loading

0 comments on commit 3a01ec2

Please sign in to comment.