ngsComposer is an automated pipeline for demultiplexing and quality-filtering Next-Generation Sequencing (NGS) reads.
For questions, bugs, and suggestions, please contact: bolukolu@utk.edu.
Developers: Ryan G. Kuster (UTK, TN) and Bode A. Olukolu (UTK, TN)
- Full start-to-finish pipeline for various NGS library types
- Few dependencies (Python3 and R)
- Easy to learn and designed for biologists
- Supports variable length barcodes and dual-indexing.
- Trims buffer sequences and quality filters on a read-by-read basis
- Accepts project directory of multiple libraries
Currently, ngsComposer is only available for unix-based systems (i.e. macOS and linux).
Clone or download the Git repository to your desired folder
git clone https://github.com/bodeolukolu/ngsComposer.git
Dependencies:
- Python3 version 3.5 or above
- R, R-ggplot2
- pigz (not required, but recommended for parallel gzip and gunzip i.e. faster)
For help troubleshooting installation, see the troubleshooting section
Set up your project directory containing the following:
- A folder named "samples", which contains fastq file(s). Multiple libraries or demultiplexed fastq files can be included.
- A file named "barcodes_lib1.txt", which contains barcodes and associated sample IDs. For additional libraries with different sample IDs specifiy "barcodes_lib2.txt", "barcodes_lib3.txt", ... "lib1_R1" and "lib1_R2" are variables in "config.sh" for R1 and R2 fastq file names, respectively, and correspond to "barcodes_lib1.txt". For barcodes/barcode pairs that is not assigned to any sample, indicate the sample id as NA or na.
- 2 files containing adapter sequences of R1/P5/forward (<adapters.R1.txt>) and R2/P7/reverse (<adapters.R2.txt>) reads.
- config.sh (see "Configuration" below for detailed instructions on creating this file) .
From command line, run ngsComposer as shown below:
$ bash <path_to_ngsComposer_directory>/ngsComposer <path_to_project_directory>
If this is the first time running the pipeline, you may need to wait for R to install the appropriate packages and dependencies.
Several example datasets are included in the "examples" directory. Users are encouraged to examine and run these small projects to assist in understanding pipeline functionality.
The order of steps in the ngsComposer pipeline are outlined in the following figure:
The steps implemented are first specified in a configuration file.
Using a text editor, save a file containing any of the following variables as a bash script called 'config.sh' and include it in your project directory.
General parameters
Variable | Default | Usage | Input | required/Optional |
---|---|---|---|---|
threads | total-2 | choose maximum number of subprocesses that can run simultaneously | integer | optional |
walkaway | True | run from beginning to end without pausing at qc steps | True or False | optional |
cluster | False | run on compute cluster node (default: slurm) or workstation | True or False | optional |
samples_alt_dir | False | input files stored in different different from project directory | True or False | optional |
rm_transit | True | remove each transitional file folder to save space | True or False | optional |
Input files
Variable | Default | Usage | Input | required/Optional |
---|---|---|---|---|
lib1_R1 | na | input fastq file name for R1/P5/forward reads | string | required |
lib1_R2 | na | input fastq file name for R2/P7/reverse reads | string | optional |
lib1_bc | na | name of file containing barcodes | string | required |
lib2_R1 | na | additional input fastq file name for R1/P5/forward reads | string | required |
lib2_R2 | na | additional input fastq file name for R2/P7/reverse reads | string | optional |
lib2_bc | na | additional name of file containing barcodes | string | required |
Tool Parameters
Variable | Default | Usage | Input | required/Optional |
---|---|---|---|---|
front_trim | 0 | number of bases in buffer sequence to trim | integer | optional |
mismatch | 1 | number of mismatches (hamming distance) allowed in barcodes | integer | optional |
R1_motif | na | motif filtering for R1/P5/forward reads | string or list of comma-separated strings | optional |
R2_motif | na | motif filtering for R2/P7/reverse reads | string or list of comma-separated strings | optional |
end_score | 20 | end-trim once entire window >= this Q score | integer | optional |
window | 10 | size of window to test for >= end_trim | integer | optional |
min_len | 64 | minimum read length to retain after end-trimming and adapter removal | integer | optional |
adapter_match | 12 | number of base matches to identify adapters | integer | optional |
q_min | 20 | Q score minimum (Phred value 0-40) applied to q_percent variable | integer | optional |
q_percent | 80 | percentage of basses in read >= q_min Q scores | integer | optional |
Visualizations
Variable | Default | Usage | Input | required/Optional |
---|---|---|---|---|
QC_demultiplexed | na | produce Quality score plot | summary and/or full | optional |
QC_motif_validated | na | produce Quality score plot | summary and/or full | optional |
QC_end_trimmed | na | produce Quality score plot | summary and/or full | optional |
QC_adapter_removed | na | produce Quality score plot | summary and/or full | optional |
QC_final | na | produce Quality score plot | summary and/or full | optional |
**Note: na indicates that variable is user-defined. Analytical will be skipped if set to na.
An example configuration file may look like this:
config.sh
#General_parameters
###################################################
threads=24
walkaway=True
cluster=True
samples_alt_dir=False
rm_transit=True
#Input_files
###################################################
lib1_R1=test1_R1.fastq.gz
lib1_R2=test1_R2.fastq.gz
lib1_bc=barcodes_lib1.txt
lib2_R1=test2_R1.fastq.gz
lib2_R2=test2_R2.fastq.gz
lib2_bc=barcodes_lib2.txt
#Tool_parameters
###################################################
front_trim=6
mismatch=1
R1_motif=TGCATA,TGCATC,TGCATT
R2_motif=CATG
end_score=20
window=10
min_len=64
adapter_match=12
trim_homopolymer=10
q_min=20
q_percent=80
#Visualizations
###################################################
QC_demultiplexed=summary,full
QC_motif_validated=summary,full
QC_end_trimmed=summary,full
QC_adapter_removed=summary,full
QC_final=summary,full
In the above example, the maximum number of subprocesses spawned will be 24 (threads = 24). The pipeline will pause after relevant steps (walkaway = False) so users can view qc plots and have the option of modifying or bypassing the step. To save disk space, transitional directories will be removed (rm_transit = True) and only the final filtered data and any qc stats created in the pipeline will remain. Regardless of if walkaway if True or False, pipeline will ask if initial_qc should be generated.
A buffer sequence of length 6 (front_trim = 6) will be trimmed before demultiplexing, which will allow mismatch at a hamming distance of 1 (mismatch=1). For variable length barcodes, the same number of proximal bases (based on the minimum barcode length) are used for demultiplexing, while the additional distal bases in barcodes are trimmed off.
In this case, samples were double-digested with AluI and HaeIII and A-tailed before adapter ligation (R1_motif=TCC,TCT and R2_motif=TCC,TCT). Only reads containing these motifs will pass to subsequent steps. For A-tailed libraries, an A can be appended to the R1_motif and R2_motif strings.
Automatic end-trimming will be performed based on Q score. Here, groups of bases are considered within a moving window of 10 bases at a time (window=10) until that window consists only of the desired Q score at or above 20 (end_score=20). It is at this point that the read is trimmed. Reads that are less than 64 bp will be discarded (min_len=64)
Only reads that have a Q score of 20 (q_min=20) acrosss at least 95 percent of the read (q_percent=80) will pass to subsequent steps. If a R1 read or an R2 read passes while its partner fails, it will be placed into a single-end read subfolder and the failing read will be discarded.
Alternatively, a configuration file may only need to include necessary components for a run:
conf.py
#Input_files
###################################################
lib1_R1=test1_R1.fastq.gz
lib1_bc=barcodes_lib1.txt
*Since most of the parameters are hard-coded in an intuitive manner, by specifying only the fastq file name (at least single-end data) and associated barcode (only required for demultiplexing), the pipelines determines the other parameters as some stated below:
- threads: computes available number of cores (n) and uses n-2 threads
- defaults: walkaway=True, cluster=False, samples_alt_dir=False, rm_transit=True, front_trim=0, mismatch=1, no motif filtering, end_score=20, min_len=64, adapter_match=12, q_min=20, q_percent=80, only summary final QC, and initial QC will be determined based on a prompt before submitting job.
Optionally, one or more barcode files may be included in the project directory for demultiplexing. The following files are required at minimum:
- barcodes_1.txt
- index.txt
Naming conventions: "index.txt" is required, the barcodes file can be named as desired (see "Index file for directing multiple barcode files")
The barcodes file is a tab or space delimited file with no spaces in sample names (or, copy directly from your favorite spreadsheet program into a text file). Forward barcodes begin each row and reverse barcodes begin each column with the desired sample names indicated in the interior of the matrix. For example, the following would be required for a dual-indexed library:
barcodes_1.txt
A C G T
A sample1 sample5 sample6 sample10
C sample2 sample5 sample7 sample10
G sample3 sample5 sample8 sample10
T sample4 sample5 sample9 sample10
Note that in the example above the reverse barcode "C" corresponds with multiple identical sample names (sample5). While not common practice, ngsComposer accomodates repeated sample names and concatenates accordingly.
If reverse barcodes do not require demultiplexing, the barcode file can be set up as follows with "NA" or any other text used as a header in the first row:
barcodes_1.txt
NA
A sample1
C sample2
G sample3
T sample4
Optionally, 'adapters.R2.txt' and 'adapters.R1.txt' may be included in the project directory for recognition and removal of adapters. The 'adapters.R2.txt' file contains the adapters expected to appear in the R1 reads. Adapter sequences should be newline-separated and be in 5' to 3' orientation. If libraries are barcoded, users are encouraged to provide adapter sequences that contain the corresponding barcodes expected in the opposing end of the read's adapter.
adapters.R2.txt
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCGCTCAGTTC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTATCTGACCT
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTATATGAGACG
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCTTATGGAAT
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTAATCTCGTC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCGCGATGTT
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTAGAGCACTAG
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTGCCTTGATC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCTACTCAGTC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTCGTCTGACT
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGAACATACGG
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCCTATGACTC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTTAATGGCAAG
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGTGCCGCTTC
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTCGGCAATGGA
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTGCCGTAACCG
GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCTAACCATTCTC
Each of the above sample adapters is presented in 5' to 3' orientation and shares a common 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT' adapter sequence followed by expected barcodes. Adapter sequences may also include restriction motifs for greater detection, but these sequences will also be removed. Porifera.py creates all reverse-complements before detection.
When paired end data is used, as above, 'adapters.R1.txt' and 'adapters.R2.txt' must be provided. Adapters are tested for the inclusion of barcodes and only those combinations of R1/R2 barcodes leading to a given sample will be used to search for adapters quickly and with a lower false positive rate.
All tools available in the ngsComposer pipeline can be called individually from the command line. Please see the ngsComposer Standalone Tools page for usage.
- GBSapp: automated pipeline for variant calling and filtering.
- Qmatey: Quantitative Metagenomic Alignment and Taxonomic Exact-matching.
- ngsComposer: an automated pipeline for empirically based NGS data quality filtering. Kuster et al. 2021
This package has been developed as part of the Genomic Tools for Sweetpotato Improvement project (GT4SP) and SweetGAINS, both funded by Bill & Melinda Gates Foundation.
To view Python version, from the terminal type:
$ python3 --version
If python3 is not found, you can try one of the python3 releases from the Python Software Foundation downloads page.
Alternatively, a package manager is an easy way to install Python from the terminal. For Ubuntu, Python can be installed directly using apt (replace 'X' with an existing version in the apt repository):
$ sudo apt-get update
$ sudo apt-get install python3.X
...or with homebrew on macOS using:
brew install python3
After installation please check that the newest version is present in your current environment (i.e.; $PATH).
To view R version, from the terminal type:
$ R --version
To install the newest version of R, see the releases available at the Comprehensive R Archive Network downloads page.
For Ubuntu, R can be installed directly using apt:
$ sudo apt update
$ sudo apt install r-base
...or with homebrew on macOS using:
brew install r
Notes on ggplot2 installation:
ngsComposer requires the R package ggplot2 and its dependencies. ngsComposer will attempt to automatically download these packages to the local ngsComposer repo (/ngsComposer/tools/helpers/R_packages).
The installation of ggplot2 and dependencies may take some time during the first use. If package installation fails, manual installation within R may be necessary. It may be beneficial to install R packages as root and if using macOS, ensure Xcode toolkit is up to date.
Versioning will follow major.minor.patch semantic versioning format.