This repository is for the study "Sequence analysis and decoding with extra low-quality reads for DNA data storage" which submitted to Bioinformatics in 2024.
Here, we provide the source code and sequencing data.
(Current version was revised in Nov. 5, 2024)
We use pass filter (PF) reads and non-pass filter (NPF) reads of Illumina NGS sequencing.
- PF: pass the chastity filter with an identified index pattern
- NPF: fail to pass the filter
NPF reads are not provided as FASTQ files in Illumina NGS sequencing.
Therefore, we obtained raw sequencing data from Illumina sequencer and performed base-calling on NPF reads from the raw data.
The detailed Illumina sequencing settings are described in supplementary.docx and the sequencing cycle is 151-6-151 (R1-index-R2).
Based on MiSeq configurations, we obtained the following raw sequencing data: cif, filter, and locs files.
-
*.cif (./dataset/raw/cif/): contains RTA image analysis results for one cycle and one tile.
-
*.filter (./dataset/raw/filter/): contains chastity filter results for one tile.
-
*.locs (./dataset/raw/locs/): contains cluster coordinates for one file.
We conducted base-calling to generate FASTQ files from cif data using AYB with default options.
Since the raw data includes not only PF and NPF reads but reads with a invalid index, we classified the reads using the FASTQ files produced by Illumina sequencing.
The detailed method is described in README of "./dataset/".
- AYB-basecalled FASTQ (./dataset/AYB_fastq/)
- Illumina-basecalled FASTQ (./dataset/Illumina_fastq/)
We also provide the testset (FASTQ including PF and NPF reads) to use our method.
- testset (./dataset/)
- Python (3.7+)
- Matlab (with Communications Toolbox)
- C (gcc 7.5.0+)
- Edit distance based-clustering Starcode (to be located in ./src/utils/starcode/)
- Sequence alignment MUSCLE (version 5.0.1428) (to be located in ./src/utils/MUSCLE/)
- Paired-end read merging PEAR (version 0.9.11) (to be located in ./src/utils/PEAR/)
- <seed_num>: Base seed of random generator (unsigned int)
- <sample_num>: Random sampling number (unsigned int)
- <trial_num>: Decoding trial index (unsigned int)
- <use_NPF>: 0 - use only PF reads, 1 - use PF + NPF reads (0 or 1)
- <len_org>: Original length of an oligo sequence (unsigned int)
- <tau_e>: Edit distence threshold of starcode (unsigned int)
- <tau_adj>: Edit distance thresholf of tailored edit distance-based clustering (unsigned int)
- <tau_sub>: Substitution threshold of tailored edit distance-based clustering (unsigned int)
- <tau_del>: Deletion threshold of tailored edit distance-based clustering (unsigned int)
- <tau_ins>: Insertion threshold of tailored edit distance-based clustering (unsigned int)
- <len_min>: Minimum length of AL reads (unsigned int)
- <len_max>: Maximum length of AL reads (unsigned int)
bash sampling.sh <seed_num> <sample_num> <trial_num>
We implemented the Erlich's method.
You can run it by bash erlich.sh
with the following options.
bash erlich.sh <seed_num> <sample_num> <trial_num> <use_NPF> <len_org>
Also, You can use our proposed method by bash prop.sh
with the below options.
bash prop.sh <seed_num> <sample_num> <trial_num> <use_NPF> <tau_e> <tau_sub> <tau_del> <tau_ins> <tau_adj> <len_org> <len_min> <len_max>
If you want to only set the edit distance threshold <tau_adj>
for tailored edit distance-based clustering, you can run prop.sh
with the below options.
bash prop.sh <seed_num> <sample_num> <trial_num> <use_NPF> <tau_e> 0 0 0 <tau_adj> <len_org> <len_min> <len_max>
E-mail: wldus8677@gmail.com
Homepage: CICL