-
This Github repository contains scripts developped within the scope of ANR ASTER project. It aims to determine whether long-reads sequencing with Oxford Nanopore Technology (ONT) are suitable in the context of targeted metagenomics (16S-amplicons).
It is composed of 4 python scripts (0-solve_SILVA.py
,1-assign_pipeline.py
,2-prim_analysis.py
,3-second_analysis.py
), and 2 R scripts (make_stackbar.R
,make_radar_plot.R
). All 6 are in the base directory and Python modules are stored in thesrc/
directory. -
Proper usage helps can be produced for most Python scripts, using
-h | --help
option (expect for0-solve_SILVA.py
). Example:Taxonomic determination statistics computation Usage: 2-prim_analysis.py (-i <inFile>) (-c <confFile>) (-l <taxoCut>) [-w <writeMap>] [-o <outDir>] Options: -h --help help --version version of the script -i --inFile=input_file input file -c --confFile_conf_file path to the configuration file -l --taxoCut=taxo_cutoff cutoff for the taxonomic level -w --writeMap=write_map flag to either write (reads-vs-OTUs) mapping file [default: F] -o --outDir=out_dir /path/to/output_csv [default: ./]
- The Python (3.6.0) packages required are:
matplotlib 2.2.3 or higher numpy 1.15.2 or higher biopython 1.72 or higher docopt 0.6.2 or higher pandas 0.22.0 or higher sklearn 0.17 or higher pysam 0.15.2 or higher
They can all be installed in once using the supplied requirements.txt
file
with pip:
pip3 install -r requirements.txt
(or simply pip install restOfCmd
, if your pip
runs Python 3)
- The R librairies required are:
ggplot2
dplyr
reshape2
fmsb
Once Centrifuge is properly installed, you can build a custom Centrifuge index with the following commands:
- Download NCBI "taxonomy" database dump files (
names.dmp
andnodes.dmp
):
centrifuge-download -v -o ./ taxonomy
(can be reused for all Centrifuge index built) - Generate a proper
seqid2taxid
file (mapping sequence headers with their taxid), being sure that present taxids are consistent with downloaded NCBI taxonomy - Build the custom index using both dump files and the “seqid2taxid” file:
entrifuge-build --conversion-table seqid2taxid --taxonomy-tree nodes.dmp --name-table names.dmp ref_database.fa outPrefix
Use of the taxonomic module:
Most scripts use the
ncbi_taxdump_utils.py
Python module, which needs 2 dump files to be initialized (names.dmp
andnodes.dmp
). These 2 files (can be retrieved from NCBI ftp) need to be put in adump_files/
folder at root directory (or in the same folder from where you ran one of the scripts).
The hard-coded line for this parameter is at line 19 of thesrc/taxo_eval.py
module.
Modification of the
pipeline.conf
file:Most scripts need to have the location of certain files to work (for example of the Centrifuge index) to work. All this hard-written paths must be put into the
pipeline.conf
configuration file (CSV format). This file has to be adapted to your own locations before running any Python script.
The generic patterns to specify a path are precised within this.conf
file.
- The
0-solve_SILVA.py
script was dedicated to solve all issues linked to taxonomy. So it contains several functions, that have to be called by modifying directly the 'main'. - For example to run the function producing information about a given database:
# MAIN:
if __name__ == "__main__":
stats_base('SILVA')
And then: ./0-solve_SILVA.py
(if you are in the root directory of the repository)
This script proceeds to taxonomic determination with either Centrifuge or
Minimap2 against different possible databases (ZYMO, RRN, SILVA, NCBI_16S etc).
It aims to simplify nomenclature of the output file (SAM for Minimap2 or
CSV for Centrifuge), to make output file easily manipulated by
2-prim_analysis.py
script.
List of features added within the name:
- The database used, for example: 'your_fav_name*_toRrn*.*' (with RRN database)
The name of the database given must not contain any '_'. The ideal is to use 'camelcase' style, something like that: 'pCompressed'. - If Centrifuge is used: '*centri_*your_fav_name_toRrn.csv'
Each taxonomic determination produces at least a file containing assignations
(.SAM or .CSV) and a .log file. An output directory can be specified with the
-o | --outDir
option. Example of command:
./1-assign_pipeline.py -i my_fav_file.fq -c pipeline.conf -P minimap2 -t 20 -d silva -o ../3-deter_minimap2_second/
- This script can currently handle 4 different input files:
- SAM file, produced by Minimap2 (SAM produced with another program are not guaranted to be handled)
- TSV file produced by Centrifuge
- CSV file produced by WIMP (ONT commercial tool)
- CSV file produced by EPI2ME-BLAST (ONT commercial tool)
The extension of the file matters here, as the treatment will be different. With
a SAM file, the script will write a CSV file after the SAM parsing
(destination given by the -o | --outDir
option. Like that the next time, the
script will look directly (still in 'outDir') for this written CSV and time
is saved. This CSV is minimalist enough (only reads with a list their target
seqID from mapping), to be reused with another seqid2taxid
.
Example of cmd:2-prim_analysis.py -i /path/to/your_fav_file.sam -c pipeline.conf -l genus -o ./my_CSVs/
To handle multi-hits, the default way is 'minor_rm+LCA', but 3 other methods
('topOne', 'majoVoting', 'LCA') can be used, by setting the proper line, at the
hard-coded section around line 650 of 2-prim_analysis.py
script.
The same way, you can apply a different filter on Centrifuge results,
rather based on a ratio 'hitLength/queryLength' > 40% (currently, the cutoff on
'hitLength' is fixed at 50, as described in Centrifuge's paper). The concerned
section of code is around line 100 of the script.
This script also performs a 'correction' with RRN database for B. intestinalis
(converted into B. subtillis), around line 460.
-
This script basically performs relative abundance calculations. It takes as input a TSV, produced using several scripts from QIIME1.
Once you activated a proper Anaconda environment containing QIIME1, the classic pipeline to obtain such file can be run with./input_second.sh /path/to/your_fav_SAM.sam /path/ouput/directory
(usage for this Bash script can be found within it). -
This script has to be run from root directory, to have an access to the
metadat_files/
folder. This folder contains 'taxonomic metadata files', i.e. a 2-colums TSV file, associating each taxid from this 'complete_lineage' with its SILVA-formatted taxonomy.
Such 'taxo_metadata' file can be generated from a seqid2taxid file, using thewrite_metadat_file(/path/to/seqid2taxid, name_DB)
function of the./0-solve_SILVA.py
script.
Both scripts can be run giving a CSV as command-line argument:
./make_stackbar.R well_formatted_file.csv
. Examples of input csv can be found
in examples/
:
- Files
cusco2018_abund_vs-different-DB.csv
andsampl500K_Minimap2_to-different-DB_SPECIES.csv
formake_stackbar.R
) - File
16Skit_run1_sampl500K_Minimap2_raw-vs-filt_toZymo_SPECIES.csv
formake_radar_plot.R
For make_stackbar.R
, lines can represent either runs (e.g. '16Skit_run2' and
'16Skit_run2') or conditions ('toMockDB'). The most important is the header,
which has to be formatted as follow: "species-Name_runID" ('abundances mode') or
"metric1_cond1" ('metrics mode'). The key point here being the underscore, that
must be present only once in each column name, to properly separate each 'category'.
In 'metrics mode', this script accepts 2 different metrics (normally FDR and recall).
The script make_radar_plot.R
has been coded on the same principle, except that
it aims to generate a 'radar plot', displaying a unique value (e.g. a relative abundance)
for each species present (in our case: 8 bacteria from Zymo mock community +
'misassigned' category) in columns and for several runs/conditions (lines).
- Example of stdout produced by
2-prim_analysis.py
script:
- Example of R plots that can be produced using R scripts:
- General pipeline to trim adapters (Porechop), detect chimeras (yacrd), proceed to taxonomic determination (either Minimap2 or Centrifuge), against different possible databases (ZYMO, RRN, SILVA or p_compressed, with Centrifuge only)
- Evalutation of performances after taxonomic assignation
- Computation of several metagenomic metrics
If you detect any problems or bugs, feel free to contact me.