Juno-Typing

Typing tools (7-locus MLST and serotyping) for different bacterial genera/species.

Pipeline information

Author(s): Alejandra Hernández Segura, Roxanne Wolthuis, Karim Hajji
Organization: Rijksinstituut voor Volksgezondheid en Milieu (RIVM)
Department: Infektieziekteonderzoek, Diagnostiek en Laboratorium Surveillance (IDS), Bacteriologie (BPD)
Start date: 01 - 03 - 2021
Commissioned by: Maaike van den Beld

About this project

The goal of this pipeline is to perform bacterial typing (7-locus MLST and serotyping). It takes 2 types of files per sample as input:

Two ‘.fastq’ files (paired-end sequencing) derived from short-read sequencing. They should be already filtered and trimmed (for instance, with the Juno-pipeline).
An assembly from the same sample in the form of a single ‘.fasta’ file.

Importantly, the Juno-Typing pipeline works directly on output generated from the Juno-assembly pipeline.

The Juno-Typing pipeline will then perform the following steps:

The appropriate 7-locus MLST schema and eventually a serotyper. The supported species for the 7-locus MLST can be found in the database generated by the Center for Genomic Epidemiology from the Technical University of Denmark.
7-locus MLST by using the MLST tool.
If appropriate for the genus/species, the samples will be serotyped. The currently supported species are:
- Salmonella serotyper by using the SeqSero2 tool.
- E. coli serotyper by using the SerotypeFinder tool.
- S. pneumoniae serotyper by using the Seroba tool.
- Shigella serotyper by using the ShigaTyper tool.
- Neisseria serotyper by using the Capsule Characterization Neisseria tool.
Predict rRNA sequences using Barrnapp and extract 16S sequences.

Prerequisities

Linux + conda A Linux-like environment with at least 'miniconda' installed.
Python3.7.6 .

Installation

Clone the repository:

git clone https://github.com/RIVM-bioinformatics/juno-typing.git

Alternatively, you can download it manually as a zip file (you will need to unzip it then).

Enter the directory with the pipeline and install the master environment:

cd juno-typing
conda env create -f envs/juno_typing.yaml

Parameters & Usage

Command for help

-h, --help Shows the help of the pipeline

Required parameters

-i, --input Directory with the input (fasta) files. The fasta files should be all in this directory (no subdirectories) and have the extension '.fasta'.

Optional parameters

-s --species Species for ALL samples (it assumes all of them are the same species). It should be two words (e.g Salmonella enterica or Escherichia coli). If a metadata file is also provided, the -s argument will take precedence and be used instead. The species is used to choose the scheme for MLST7 and the appropriate serotyper (if any).
-m --metadata Relative or absolute path to a csv file containing at least one column with the 'sample' name (name of the file but removing [_S##]_R1.fastq.gz), a column called 'genus' and a column called 'species' (Note that the sample names are written in small letters, not a single capital letter). If none is given and the input directory contains a file called '<input_dir>/identify_species/top1_species_multireport.csv' (as obtained with the Juno-assembly pipeline) this will be used as metadata. If a species is provided for a sample, it will overwrite the metadata when choosing the scheme for MLST and the serotyper. Example metadata file:

sample	genus	species
sample1	salmonella	enterica

Note: The fastq files corresponding to this sample would probably be something like sample1_S1_R1_0001.fastq.gz and sample2_S1_R1_0001.fastq.gz and the fasta file sample1.fasta. Also note that the column titles of the metadata.csv file are all in lower case.

-o --output Directory (if not existing it will be created) where the output of the pipeline will be collected. The default behavior is to create a folder called 'output' within the pipeline directory.
-d --db_dir Directory (if not existing it will be created) where the databases used by this pipeline will be downloaded or where they are expected to be present. Default is '/mnt/db/juno/typing_db' (internal RIVM path to the databases of the Juno pipelines). It is advisable to provide your own path if you are not working inside the RIVM Linux environment.
--serotypefinder_mincov Minimum coverage (ranging from 0-1) used by SerotypeFinder to identify the appropriate alleles. Default is 0.6.
--serotypefinder_identity Identity threshold to be used for identifying alleles by SerotypeFinder (ranging from 0-1). Default is 0.85.
--seroba_mincov Minimum coverage (ranging from 0-100) used by Seroba to identify the appropriate alleles. Default is 20.
--seroba_kmersize Kmersize to be used for building the Seroba database. If you already downloaded the seroba database and built it with a different kmersize you have to either delete it first or use the --update flag together with this option. Default is 71.
-c --cores Maximum number of cores to be used to run the pipeline. Defaults to 300 (it assumes you work in an HPC cluster).
-l --local If this flag is present, the pipeline will be run locally (not attempting to send the jobs to a cluster). Keep in mind that if you use this flag, you also need to adjust the number of cores (for instance, to 2) to avoid crashes. The default is to assume that you are working on a cluster because the pipeline was developed in an environment where it is the case.
-q --queue If you are running the pipeline in a cluster, you need to provide the name of the queue. It defaults to 'bio' (default queue at the RIVM).
-n --dryrun, -u --unlock and --rerunincomplete are all parameters passed to Snakemake. If you want the explanation of these parameters, please refer to the Snakemake documentation.
--update If this flag is present, the databases will be re-downloaded even if they are present already.

The base command to run this program.

python juno_typing.py -i [dir/to/input_directory]

An example on how to run the pipeline.

python juno_typing.py -i my_input_files -o my_results --db_dir my_db_dir --metadata path/to/my/metadata.csv --local --cores 2

Explanation of the output

log: Log files with output and error files from each Snakemake rule/step that is performed.
audit_trail: Information about the versions of software and databases used.
output per sample: The pipeline will create one subfolder per each step performed (identify_species, mlst7, serotype, 16s). These subfolders will in turn contain another subfolder per sample. To understand the output, please refer to the manuals of each individual tool. Inside the serotype folder, there will be generated a .csv file that summarizes the results of all the samples for each serotyper that has run (serotype_multireport.csv, serotype_multireport1.csv, serotype_multireport2.csv, serotype_multireport3.csv).

Issues

All default values have been chosen to work with the RIVM Linux environment, therefore, there might not be applicable to other environments (although they should work if the appropriate arguments/parameters are given).
Any issue can be reported in the Issues section of this repository.

Future ideas for this pipeline

Do 16S typing

License

This pipeline is licensed with an AGPL3 license. Detailed information can be found inside the 'LICENSE' file in this repository.

Contact

Contact person: Roxanne Wolthuis
Email roxanne.wolthuis@rivm.nl

For 16S extraction related questions:

Contact person: Karim Hajji
Email karim.hajji@rivm.nl

Name		Name	Last commit message	Last commit date
Latest commit History 338 Commits
.github/workflows		.github/workflows
bin		bin
config		config
envs		envs
files		files
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
juno_typing.py		juno_typing.py
mypy.ini		mypy.ini
run_pipeline.sh		run_pipeline.sh
version.py		version.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Juno-Typing

Typing tools (7-locus MLST and serotyping) for different bacterial genera/species.

Pipeline information

About this project

Prerequisities

Installation

Parameters & Usage

Command for help

Required parameters

Optional parameters

The base command to run this program.

An example on how to run the pipeline.

Explanation of the output

Issues

Future ideas for this pipeline

License

Contact

About

Releases 38

Packages

Contributors 6

Languages

License

RIVM-bioinformatics/juno-typing

Folders and files

Latest commit

History

Repository files navigation

Juno-Typing

Typing tools (7-locus MLST and serotyping) for different bacterial genera/species.

Pipeline information

About this project

Prerequisities

Installation

Parameters & Usage

Command for help

Required parameters

Optional parameters

The base command to run this program.

An example on how to run the pipeline.

Explanation of the output

Issues

Future ideas for this pipeline

License

Contact

About

Resources

License

Stars

Watchers

Forks

Releases 38

Packages 0

Contributors 6

Languages

Packages