diff --git a/README.md b/README.md index f3784ac2..cdd8706d 100644 --- a/README.md +++ b/README.md @@ -24,25 +24,25 @@ On release, automated continuous integration tests run the pipeline on a full-si ## Pipeline summary -![nf-core/rnaseq metro map](docs/images/nf-core-metatdenovo_metro_map.png) +![nf-core/metatdenovo metro map](docs/images/metatdenovo.png) 1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) 2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/)) -3. Quality trimming and adapters removal for raw reads [`Trimm Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/) -4. Filter sequences with [`BBduk`](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/) -5. Normalize the sequencing depth with [`BBnorm`](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbnorm-guide/) -6. Merge trimmed, pair-end reads ( [`Seqtk`](https://github.com/lh3/seqtk)) +3. Quality trimming and adapter removal for raw reads ([`Trim Galore!`](https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/)) +4. Optional: Filter sequences with [`BBduk`](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/) +5. Optional: Normalize the sequencing depth with [`BBnorm`](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbnorm-guide/) +6. Merge trimmed, pair-end reads ([`Seqtk`](https://github.com/lh3/seqtk)) 7. Choice of de novo assembly programs: - 1. [`RNAspade`](https://cab.spbu.ru/software/rnaspades/) suggested for Eukaryotes de novo assembly + 1. [`RNAspades`](https://cab.spbu.ru/software/rnaspades/) suggested for Eukaryotes de novo assembly 2. [`Megahit`](https://github.com/voutcn/megahit) suggested for Prokaryotes de novo assembly 8. Choice of orf caller: 1. [`TransDecoder`](https://github.com/TransDecoder/TransDecoder) suggested for Eukaryotes 2. [`Prokka`](https://github.com/tseemann/prokka) suggested for Prokaryotes 3. [`Prodigal`](https://github.com/hyattpd/Prodigal) suggested for Prokaryotes 9. Quantification of genes identified in assemblies: - 1. generate index of assembly [`BBmap index`](https://sourceforge.net/projects/bbmap/) - 2. Mapping cleaned reads to the assembly for quantification [`BBmap`](https://sourceforge.net/projects/bbmap/) - 3. Get raw counts per each gene present in the assembly [`Featurecounts`](http://subread.sourceforge.net) -> TSV table with collected featurecounts output + 1. Generate index of assembly ([`BBmap index`](https://sourceforge.net/projects/bbmap/)) + 2. Mapping cleaned reads to the assembly for quantification ([`BBmap`](https://sourceforge.net/projects/bbmap/)) + 3. Get raw counts per each gene present in the assembly ([`Featurecounts`](http://subread.sourceforge.net)) -> TSV table with collected featurecounts output 10. Functional annotation: 1. [`Eggnog`](https://github.com/eggnogdb/eggnog-mapper) -> Reformat TSV output "eggnog table" 2. [`KOfamscan`](https://github.com/takaram/kofam_scan) @@ -50,7 +50,7 @@ On release, automated continuous integration tests run the pipeline on a full-si 11. Taxonomic annotation: 1. [`EUKulele`](https://github.com/AlexanderLabWHOI/EUKulele) -> Reformat TSV output "Reformat_tax.R" 2. [`CAT`](https://github.com/dutilh/CAT) -12. Summary statistics table. Collect_stats.R +12. Summary statistics table. "Collect_stats.R" ## Usage @@ -71,8 +71,6 @@ CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz Each row represents a fastq file (single-end) or a pair of fastq files (paired end). ---> - Now, you can run the pipeline using: ```bash @@ -97,6 +95,9 @@ To see the results of an example test run with a full size dataset refer to the For more details about the output files and reports, please refer to the [output documentation](https://nf-co.re/metatdenovo/output). +_Note_ the `summary_tables` directory under the output directory. +This will contain tsv tables that we have made especially for further analysis in tools like R or Python. + ## Credits nf-core/metatdenovo was originally written by Danilo Di Leo (@danilodileo), Emelie Nilsson (@emnilsson) & Daniel Lundin (@erikrikarddaniel). diff --git a/docs/images/metat_diagram.svg b/docs/images/metat_v6.svg similarity index 63% rename from docs/images/metat_diagram.svg rename to docs/images/metat_v6.svg index 349f674d..34abdb94 100644 --- a/docs/images/metat_diagram.svg +++ b/docs/images/metat_v6.svg @@ -1,32 +1,32 @@ + inkscape:export-ydpi="1209.5238" + xmlns:inkscape="http://www.inkscape.org/namespaces/inkscape" + xmlns:sodipodi="http://sodipodi.sourceforge.net/DTD/sodipodi-0.dtd" + xmlns:xlink="http://www.w3.org/1999/xlink" + xmlns="http://www.w3.org/2000/svg" + xmlns:svg="http://www.w3.org/2000/svg" + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" + xmlns:cc="http://creativecommons.org/ns#" + xmlns:dc="http://purl.org/dc/elements/1.1/"> + id="metadata382"> image/svg+xml - + @@ -41,22 +41,27 @@ inkscape:deskcolor="#d1d1d1" inkscape:document-units="mm" showgrid="false" - inkscape:zoom="1.2329405" - inkscape:rotation="-90" - inkscape:cx="400.17085" - inkscape:cy="546.55392" + inkscape:zoom="0.60672258" + inkscape:cx="346.94605" + inkscape:cy="421.93913" inkscape:window-width="1440" inkscape:window-height="900" inkscape:window-x="0" - inkscape:window-y="25" + inkscape:window-y="540" inkscape:window-maximized="0" inkscape:current-layer="layer1" showguides="false" - inkscape:document-rotation="0" - inkscape:snap-bbox="false" - inkscape:snap-global="false" /> + inkscape:document-rotation="0" /> + + + + transform="matrix(0.00132309,-0.51420132,0.63912079,0.00157096,136.44178,109.61432)" + style="fill:#d3d7cf;stroke-width:0.990115"> + width="244.50821" + height="464.71609" + x="-174.42569" + y="-251.48265" + transform="matrix(-0.99999979,-6.5189494e-4,-2.5911086e-4,-0.99999997,0,0)" + rx="0.0013666436" + ry="0.0013577383" /> - - - - - - - - - + width="126.23209" + height="465.20755" + x="69.872284" + y="-252.13098" + transform="matrix(-0.99999579,-0.00290087,-0.00136402,-0.99999907,0,0)" + rx="0.001364797" + ry="0.0013599088" /> + width="42.642517" + height="465.62073" + x="-214.53844" + y="-250.46025" + transform="matrix(-0.99997764,-0.00668764,-0.01037582,-0.99994617,0,0)" + rx="0.001352101" + ry="0.0013588067" /> + + + + contigs.fa + - - .fna - - - contigs.fa - - 1 - 2 + + + + + + + + + + + + + transform="rotate(90.157422)" /> + transform="rotate(90.157422)" /> + transform="rotate(90.157422,185.72439,149.48994)"> + transform="rotate(90.157422,183.93848,152.6337)"> - - + + + contigs.fa + + - contigs.fa + style="fill:none;fill-opacity:1;stroke:#ef2929;stroke-width:0.7;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:0;stroke-dasharray:none;stroke-opacity:1" + d="m 103.72941,73.313799 -8.240554,0.0224 c -2.665883,0.007 -3.004514,-0.23473 -3.254443,-1.01205 l -1.109667,-3.45117" + id="path1232-7-6-0-8-2-1-4-3" + sodipodi:nodetypes="cssc" /> + style="fill:none;fill-opacity:1;stroke:#ef2929;stroke-width:0.7;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:0;stroke-dasharray:none;stroke-opacity:1" + d="m 73.207849,73.227369 -7.004657,0.0334 c -2.266062,0.0104 -2.372525,-0.40086 -2.766116,-1.42446 l -1.825277,-4.74696" + id="path1232-7-6-0-8-2-1-4-3-5" + sodipodi:nodetypes="cssc" /> + + + style="white-space:pre;shape-inside:url(#rect6164);display:inline;fill:#000000;fill-opacity:1" /> Trim Galore! + x="5.3536582" + y="77.764427" + id="tspan60720">Input check Seqtk + x="64.874733" + y="82.578995" + id="tspan60720-2">BBnorm Prokka + x="46.59193" + y="66.803879">Seqtk Prodigal + x="33.964523" + y="79.827774">BBduk Transdecoder + x="139.77347" + y="46.798401" + id="tspan3269" /> + transform="rotate(90.157422)" /> + transform="rotate(90.157422)" /> + transform="rotate(90.157422)" /> - + transform="rotate(89.912049)" /> - + id="g1346" + transform="translate(0.03571859)"> + + + + + + + + bam + d="m 59.209251,223.26009 -0.147195,-4.47347 1.874407,-0.0594 c 0.101197,-0.003 0.180334,-0.0923 0.176826,-0.1989 l -0.05613,-1.70588 0.917723,-0.029 0.20968,6.37246 z m 8.432905,-6.5025 -0.07606,6.42808 -1.74396,-0.0139 0.0343,-6.45596 1.479008,0.0348 z" + id="path10998-5" + style="fill:#ffffff;stroke-width:0.119643" /> - bam - - - + x="-133.18201" + y="48.360184" /> + transform="rotate(90.157422)" /> - + transform="rotate(0.16188702)" /> + x="84.117569" + y="116.76756" /> + style="white-space:pre;shape-inside:url(#rect18458);display:inline;fill:#280b0b;fill-opacity:1" /> + x="-336.25955" + y="212.65611" /> + sodipodi:nodetypes="ccc" /> - + - + - + transform="rotate(90.157422)" /> + transform="rotate(-179.97517)" /> + transform="rotate(-179.97517)" /> - + transform="rotate(90.157422)" /> + transform="rotate(90.157422)" /> + transform="rotate(90.157422)" /> + transform="rotate(90.157422)" /> + transform="rotate(-179.55305)" /> + transform="rotate(-179.55305)" /> + transform="rotate(-179.70072)" /> - - - + transform="rotate(-179.70072)" /> + + transform="rotate(90.157422)" /> + transform="rotate(90.157422)" /> + + transform="rotate(90.157422)" /> + style="fill:none;fill-opacity:1;stroke:#e9b96e;stroke-width:0.7;stroke-linecap:butt;stroke-linejoin:miter;stroke-miterlimit:0;stroke-dasharray:none;stroke-opacity:1" + d="m 107.97967,60.082769 8.24055,-0.0224 c 2.66589,-0.007 2.94692,0.25565 3.25445,1.01202 l 1.39273,3.42557" + id="path1232-7-6-0-8-2-1-4-6" + sodipodi:nodetypes="cssc" /> + + x="-187.86928" + y="166.65393" /> - + + + + + + + + + + + + + + 7. Pipeline summary + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro, @wght=500';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-variant-east-asian:normal;font-variation-settings:'wght' 500;stroke-width:1.50201" + x="84.466125" + y="205.81772">Pipeline summary + transform="rotate(90.157422,161.92123,154.42521)"> 1. Pre-processing + id="tspan60624" + style="font-style:normal;font-variant:normal;font-weight:800;font-stretch:normal;font-size:7.76111px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Ultra-Bold';fill:#000000;stroke-width:1.50201" + x="-285.19351" + y="168.7746">STAGE 2. Assembly + id="tspan60716" + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro, @wght=500';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-variant-east-asian:normal;font-variation-settings:'wght' 500;stroke-width:1.50201" + x="-285.71375" + y="181.71123">1. 3. Mapping + id="tspan60716-7-5" + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro, @wght=500';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-variant-east-asian:normal;font-variation-settings:'wght' 500;stroke-width:1.50201" + x="-278.37741" + y="193.25288">Assembly 4. Gene annotation + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro, @wght=500';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-variant-east-asian:normal;font-variation-settings:'wght' 500;stroke-width:1.50201" + x="-279.24991" + y="212.86429">Gene annotation 5. Quantification + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro, @wght=500';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-variant-east-asian:normal;font-variation-settings:'wght' 500;stroke-width:1.50201" + x="-218.84558" + y="183.53568">Quantification 6. Taxonomy & functional annotation + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro, @wght=500';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-variant-east-asian:normal;font-variation-settings:'wght' 500;stroke-width:1.50201" + x="-218.99989" + y="193.49548">Taxonomy & functional annotation + 2. + 3. + 4. + 5. + 6. + 7. - Start + x="-64.908302" + y="164.56015">Start End step + x="-64.734276" + y="173.90677">End step Intermediate step + x="-63.712669" + y="183.71507">Intermediate step MEGAHIT + id="tspan60716-7-5-0" + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro, @wght=500';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-variant-east-asian:normal;font-variation-settings:'wght' 500;stroke-width:1.50201" + x="-278.72913" + y="203.09387">Mapping + id="g1508" + transform="translate(0.97220825,3.4600323)"> RNAspades - Prokka - Prodigal + id="tspan60716-7-5-02" + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93889px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro, @wght=500';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-variant-east-asian:normal;font-variation-settings:'wght' 500;stroke-width:1.50201" + x="-279.0354" + y="183.2151">Pre-processing Transdecoder + id="tspan60624-9" + style="font-style:normal;font-variant:normal;font-weight:800;font-stretch:normal;font-size:7.7611px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Ultra-Bold';fill:#000000;stroke-width:1.50201" + x="-122.08671" + y="169.61917">LEGEND + + + 1. + + + + 2. + + + + 3. + + - STAGE - METHOD - LEGEND - Assembler: - ORF caller: + id="tspan60716-8-9" + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.9389px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro, @wght=500';font-variant-ligatures:normal;font-variant-caps:normal;font-variant-numeric:normal;font-variant-east-asian:normal;font-variation-settings:'wght' 500;stroke-width:1.50201" + x="-164.8752" + y="72.213326">4. + + + 5. + - 3 - 4 - 5 - 6 - 7 Node + x="262.53052" + y="193.46378">Connection Point + + + + + + MEGAHIT + id="tspan60716-7-6-2" + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.23333px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro';stroke-width:1.502" + x="167.21483" + y="159.15837">Megahit RNAspades + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.23333px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro';stroke-width:1.502" + x="167.211" + y="163.90771">RNAspades BBmap/Featurecounts + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.23333px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro';stroke-width:1.502" + x="167.20705" + y="168.81134">BBmap/Featurecounts Transdecoder + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.23333px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro';stroke-width:1.502" + x="167.47807" + y="173.94759">Transdecoder Prodigal + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.23333px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro';stroke-width:1.502" + x="167.19934" + y="178.36334">Prodigal Prokka Collect_stats.R + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.23333px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro';stroke-width:1.502" + x="167.31834" + y="188.31534" + id="tspan33623-6">Collect statistics MultiQC Eukulele + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.23333px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro';stroke-width:1.502" + x="221.39536" + y="164.33365" + id="tspan33623-6-7-9">EUKulele Run DBcan + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.23333px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro';stroke-width:1.502" + x="221.39001" + y="173.98822" + id="tspan33623-6-7-9-1">KOfamScan Hmmrank + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.23333px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro';stroke-width:1.502" + x="221.39305" + y="169.21321" + id="tspan33623-6-7-9-1-9">Hmmsearch Eggnog-mapper - - sample.fq - - - .gff - - - .faa - - - .tsv - - - .tsv - - - .tsv - - - .tsv - - - .tsv - - - .tsv - - - Prokka annotation + + .tsv - + style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.23333px;font-family:AppleMyungjo;-inkscape-font-specification:AppleMyungjo;fill:#000000;fill-opacity:1;stroke-width:1.502;stroke-linejoin:bevel;stroke-dasharray:none" + x="221.3551" + y="183.08939" + id="text60718-57-2-1-5-8-8-0-7-5-7-8-0-0-2-7" + transform="matrix(0.95327762,-9.2872449e-4,7.6859838e-4,1.0490116,0,0)">Eggnog-mapper + + + sample.fq + + + + + .fna + + + .gff + + + .faa + + + + + .tsv + + + + + .tsv + + + + + .tsv + + + + + .tsv + + + + + .tsv + + + + + .tsv + + - + + + + + + .tsv + + + + + .tsv + Input check + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="81.031891" + y="59.163158" + id="tspan60720-2-3">RNAspades RNAspades + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="101.37943" + y="83.954491" + id="tspan60720-2-3-9">Megahit MEGAHIT + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="139.3992" + y="44.118732" + id="tspan60720-2-3-9-7">BBmap BBmap + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="172.31966" + y="44.306629" + id="tspan60720-2-3-9-7-1">FeatureCounts Featurecounts + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="214.51494" + y="61.205326" + id="tspan60720-2-3-9-7-1-6">Prokka annotation Collect_featurecounts.R + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="220.03065" + y="73.110458" + id="tspan60720-2-3-9-7-1-6-5">KOfamScan Analysis output + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="219.96544" + y="87.788986" + id="tspan60720-2-3-9-7-1-6-5-4">Hmmsearch Pipeline summary + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="241.11069" + y="98.149368" + id="tspan60720-2-3-9-7-1-6-5-4-4">Hmmrank Eggnog-mapper + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="222.26944" + y="101.07065" + id="tspan60720-2-3-9-7-1-6-5-4-4-8">EUKulele Run-DBcan + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="236.15005" + y="113.79197" + id="tspan60720-2-3-9-7-1-6-5-4-4-8-5">Eggnog-mapper Hmmsearch + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="242.97501" + y="126.05904" + id="tspan60720-2-3-9-7-1-6-5-4-4-8-5-3">MultiQC Hmmrank.R + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="236.02803" + y="143.78633" + id="tspan60720-2-3-9-7-1-6-5-4-4-8-5-3-0">Collect statistics Eukulele + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="142.6476" + y="100.09061" + id="tspan60720-2-3-9-7-1-6-5-4-4-8-5-3-0-0">TransDecoder Format_taxonomy.R + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="147.22314" + y="87.801529" + id="tspan60720-2-3-9-7-1-6-5-4-4-8-5-3-0-0-1">Prodigal MultiQC + style="font-style:normal;font-variant:normal;font-weight:500;font-stretch:normal;font-size:3.52778px;font-family:'Maven Pro';-inkscape-font-specification:'Maven Pro Medium';stroke-width:0.264583" + x="147.60567" + y="72.764603" + id="tspan60720-2-3-9-7-1-6-5-4-4-8-5-3-0-0-0">Prokka + Collect_stats.R + x="17.435987" + y="64.63356" + id="tspan60720-5">Trim galore! + + + + + 6. + + 7. + + + diff --git a/docs/images/mqc_fastqc_adapter.png b/docs/images/mqc_fastqc_adapter.png deleted file mode 100755 index 361d0e47..00000000 Binary files a/docs/images/mqc_fastqc_adapter.png and /dev/null differ diff --git a/docs/images/mqc_fastqc_counts.png b/docs/images/mqc_fastqc_counts.png deleted file mode 100755 index cb39ebb8..00000000 Binary files a/docs/images/mqc_fastqc_counts.png and /dev/null differ diff --git a/docs/images/mqc_fastqc_quality.png b/docs/images/mqc_fastqc_quality.png deleted file mode 100755 index a4b89bf5..00000000 Binary files a/docs/images/mqc_fastqc_quality.png and /dev/null differ diff --git a/docs/images/nf-core-metatdenovo_metro_map.png b/docs/images/nf-core-metatdenovo_metro_map.png deleted file mode 100644 index ee6edf6c..00000000 Binary files a/docs/images/nf-core-metatdenovo_metro_map.png and /dev/null differ diff --git a/docs/output.md b/docs/output.md index e7bd4322..0bad51f6 100644 --- a/docs/output.md +++ b/docs/output.md @@ -4,52 +4,43 @@ This document describes the output produced by the pipeline. -The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. +The directories listed below will be created in the results directory after the pipeline has finished. +All paths are relative to the top-level results directory. ## Pipeline overview -The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: - -- [Summary tables folder](#summary-tables-folder) - Final tables that can be imported directly in R -- [Preprocessing](#preprocessing) - - [FastQC](#fastqc) - Read quality control - - [Trim galore!](#trimgalore) - Primer trimming - - [MultiQC](#multiqc) - Aggregate report describing results - - [BBduk](#bbduk) - Filter out sequences from samples based on a fasta file (optional) - - [BBnorm](#bbnorm) - Normalize the reads in the samples for a better assembly output (optional) -- [Assembly step](#assembly-step) - Generate contigs with an assembler program - - [Megahit](#megahit) - Output from Megahit assembly (default) - - [RNASpades](#rnaspades) - Output from Spades assembly (optional) -- [Orf Caller step](#orf-caller-step) - Generate amino acids fasta file with an orf caller program - - [Prodigal](#prodigal) - Output from Prodigal (default) - - [Prokka](#prokka) - Output from Prokka (optional) - - [TransDecoder](#transdecoder) - Output from transdecoder (optional) -- [Functional and taxonomical annotation](#functional-and-taxonomical-annotation) - Predict the function and the taxonomy of the amino acids fasta file - - [Hmmrsearch](#Hmmrsearch) - Analysis made with Hmmr profiles - - [EggNOG](#eggnog) - Run EggNOG-mapper on amino acids fasta file - - [KOfamSCAN](#kofamscan) - Run KOfamSCAN on amino acids fasta file - - [EUKulele](#eukulele) - Run taxonomical annotation on amino acids fasta file -- [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution - -### Summary tables folder - -A summary report for all statistics results in tsv format. The report gives a general overview of the analysis, includes featureCounts output, taxonomical and functional annotation tables. - -
-Output file - -- `summary_tables/` - - `overall_stats.tsv`: statistics summary report. - - `*counts.tsv`: summary table for featureCounts outputs - - `*.tsv`: several tables based on the different combinations of the pipeline. From taxonomical to functional annotation (optional) - -
+The pipeline is built using [Nextflow](https://www.nextflow.io/) and the results are organized as follow: + +- [Original output](#original-output) + - [Preprocessing](#preprocessing) + - [FastQC](#fastqc) - Read quality control + - [Trim galore!](#trim-galore) - Primer trimming + - [MultiQC](#multiqc) - Aggregate report describing results + - [BBduk](#bbduk) - Filter out sequences from samples that matches sequences in a user-provided fasta file (optional) + - [BBnorm](#bbnorm) - Normalize the reads in the samples to use less resources for assembly (optional) + - [Assembly step](#assembly-step) - Generate contigs with an assembler program + - [Megahit](#megahit) - Output from Megahit assembly (default) + - [RNASpades](#rnaspades) - Output from Spades assembly (optional) + - [ORF Caller step](#orf-caller-step) - Identify protein-coding genes (ORFs) with an ORF caller + - [Prodigal](#prodigal) - Output from Prodigal (default) + - [Prokka](#prokka) - Output from Prokka (optional) + - [TransDecoder](#transdecoder) - Output from transdecoder (optional) + - [Functional and taxonomical annotation](#functional-and-taxonomical-annotation) - Predict the function and the taxonomy of ORFs + - [EggNOG](#eggnog) - Output from EggNOG-mapper (default; optional) + - [KOfamSCAN](#kofamscan) - Output KOfamSCAN (optional) + - [EUKulele](#eukulele) - Output from EUKulele taxonomy annotation (default; optional) + - [Hmmsearch](#hmmsearch) - Output from HMMER run with user-supplied HMM profiles (optional) +- [Custom metatdenovo output](#metatdenovo-output) + - [Summary tables folder](#summary-tables) - Tab separated tables ready for further analysis in tools like R and Python + - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution + +## Original output ### Preprocessing #### FastQC -[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). FastQC runs in Trim galore! therefore its output can be found in Trimgalore's folder. +[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/). FastQC is run as part of Trim galore! therefore its output can be found in Trimgalore's folder.
Output files @@ -61,13 +52,13 @@ A summary report for all statistics results in tsv format. The report gives a ge #### Trim galore! -[Trimgalore](https://github.com/FelixKrueger/TrimGalore) is trimming primer sequences from sequencing reads. Primer sequences are non-biological sequences that often introduce point mutations that do not reflect sample sequences. This is especially true for degenerated PCR primer. If primer trimming would be omitted, artifactual amplicon sequence variants might be computed by the denoising tool or sequences might be lost due to become labelled as PCR chimera. +[Trimgalore](https://github.com/FelixKrueger/TrimGalore) is trimming primer sequences from sequencing reads. Primer sequences are non-biological sequences that often introduce point mutations that do not reflect sample sequences. This is especially true for degenerated PCR primers. If primer trimming would be omitted, artifactual amplicon sequence variants might be computed by the denoising tool or sequences might be lost due to become labelled as PCR chimera.
Output files - `trimgalore/`: directory containing log files with retained reads, trimming percentage, etc. for each sample. - - `*trimming_report.txt`: Report of read numbers that pass trimgalore. + - `*trimming_report.txt`: report of read numbers that pass trimgalore.
@@ -106,8 +97,10 @@ BBduk is built-in tool from BBmap #### BBnorm -[BBnorm](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/) is a tool from BBmap that allows to reduce the coverage of highly abundant sequences and remove the sequences that are below a threshold, and can be useful if the data set is too large to assemble but also potentially improve an assembly. N.B. the digital normalization is done only for the assembly and the non-normalized sequences will be used for quantification -BBnorm is built-in tool from BBmap +[BBnorm](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/) is a tool from BBmap that allows to reduce the coverage of highly abundant sequence kmers and remove sequences representing kmers that are below a threshold. +It can be useful if the data set is too large to assemble but also potentially improve an assembly. +N.B. the digital normalization is done only for the assembly and the non-normalized sequences will be used for quantification. +BBnorm is a BBmap tool.
Output files @@ -121,20 +114,21 @@ BBnorm is built-in tool from BBmap #### Megahit -[Megahit](https://github.com/voutcn/megahit) is used to assemble the cleaned and trimmed FastQ reads to create the reference genome. +[Megahit](https://github.com/voutcn/megahit) is used to assemble the cleaned and trimmed FastQ reads into contigs. +
Output file + - `megahit/megahit_out/` - - `*.log`: it is a log file of Megahit run. - - `megahit_assembly.contigs.fa.gz`: Reference genome created by Megahit. - - `intermediate_contigs`: Folder that contains the intermediate steps of Megahit run. + - `*.log`: log file of Megahit run. + - `megahit_assembly.contigs.fa.gz`: reference genome created by Megahit. + - `intermediate_contigs`: folder that contains the intermediate steps of Megahit run.
#### RNASpades -Optionally, you can use [RNASpades](https://cab.spbu.ru/software/rnaspades/) to assemble your reference genome. -NB: we reccomend to use this assembler for eukaryotes rathern then prokaryotes. +Optionally, you can use [RNASpades](https://cab.spbu.ru/software/rnaspades/) to assemble reads into contigs.
Output files @@ -142,14 +136,15 @@ NB: we reccomend to use this assembler for eukaryotes rathern then prokaryotes. - `rnaspades/` - `rnaspades.assembly.gfa.gz`: gfa file output from rnaspades - `rnaspades.spades.log`: log file output from rnaspades run - - `rnaspades.transcripts.fa.gz`: Reference genome created by RNASpades -
+ - `rnaspades.transcripts.fa.gz`: reference genome created by RNASpades + +
-### Orf caller step +### ORF caller step #### Prodigal -As default, you can use [Prodigal](https://github.com/hyattpd/Prodigal) to find ORFs on your reference genome. +As default, [Prodigal](https://github.com/hyattpd/Prodigal) is used to identify ORFs in the assembly.
Output files @@ -163,8 +158,9 @@ As default, you can use [Prodigal](https://github.com/hyattpd/Prodigal) to find #### Prokka -As one alternative, you can use [Prokka](https://github.com/tseemann/prokka) to find ORFs on your reference genome. -NB: Prodigal and Prokka are reccomended for prokaryotic samples +As one alternative, you can use [Prokka](https://github.com/tseemann/prokka) to identify ORFs in the assembly. +In addition to calling ORFs (done with Prodigal) Prokka will filter ORFs to only retain quality ORFs and will functionally annotate the ORFs. +NB: Prodigal or Prokka are recomended for prokaryotic samples
Output files @@ -178,8 +174,8 @@ NB: Prodigal and Prokka are reccomended for prokaryotic samples #### TransDecoder -Another alternative is [TransDecoder](https://github.com/sghignone/TransDecoder) to find ORFs on your reference genome. -TransDecoder is reccomended for Eukaryotic samples +Another alternative is [TransDecoder](https://github.com/sghignone/TransDecoder) to find ORFs in the assembly. +N.B. TransDecoder is recomended for eukaryotic samples
Output files @@ -193,65 +189,87 @@ TransDecoder is reccomended for Eukaryotic samples ### Functional and taxonomical annotation -#### Hmmrsearch +#### EggNOG -You can run [Hmmsearch](https://www.ebi.ac.uk/Tools/hmmer/search/hmmsearch) scan on the reference amino acids fasta file by giving hmm profiles to the pipeline. +[EggNOG-mapper](https://github.com/eggnogdb/eggnog-mapper) will perform an analysis to assign functions to the ORFs.
Output files -- `hmmer/` - - `*.tbl.gz`: +- `eggnog/` + - `*.emapper.annotations.gz`: a file with the results from the annotation phase, see the [EggNOG-mapper documentation](https://github.com/eggnogdb/eggnog-mapper/wiki/). + - `*.emapper.hits.gz`: a file with the results from the search phase, from HMMER, Diamond or MMseqs2. + - `*.emapper.seed_orthologs.gz`: a file with the results from parsing the hits. Each row links a query with a seed ortholog. This file has the same format independently of which searcher was used, except that it can be in short format (4 fields), or full.
-Automatically, the pipline will run Hmmrank in order to find the best rank for each ORFs of your reference file. +#### KOfamScan + +[KOfamScan](https://github.com/takaram/kofam_scan) will perform an analysis to assign KEGG orthologs to ORFs.
Output files -- `hmmrank/` - - `*.tsv.gz`: tab separeted file with the ranked ORFs for each HMM profile. +- `kofamscan/` + - `*.kofamscan_output.tsv.gz`: kofamscan output.
-#### EggNOG +#### EUKulele -[EggNOG-mapper](https://github.com/eggnogdb/eggnog-mapper) will perform an analysis to assign a function to the ORFs +[EUKulele](https://github.com/AlexanderLabWHOI/EUKulele) will perform an analysis to assign taxonomy to the ORFs. +A number of databases are supported: MMETSP, PhyloDB and GTDB. +GTDB currently only works as a user provided database, i.e. data must be downloaded before running nf-core/metatdenovo.
Output files -- `eggnog/` - - `*.emapper.annotations.gz`: A file with the results from the annotation phase. Therefore, each row represents the annotation reported for a given query. - - `*.emapper.hits.gz`: A file with the results from the search phase, from HMMER, Diamond or MMseqs2. - - `*.emapper.seed_orthologs.gz`: A file with the results from parsing the hits. Each row links a query with a seed ortholog. This file has the same format independently of which searcher was used, except that it can be in short format (4 fields), or full. +- `eukulele/assembler.orfcaller/mets_full/diamond/` + - `*.diamond.out.gz`: Diamond output +- `eukulele/assembler.orfcaller/taxonomy_estimation/` +- `*-estimated-taxonomy.out.gz`: EUKulele output
-#### KOfamScan +#### Hmmsearch -[KOfamScan](https://github.com/takaram/kofam_scan) will perform an analysis to assign a function to the ORFs +You can run [hmmsearch](https://www.ebi.ac.uk/Tools/hmmer/search/hmmsearch) on ORFs using a set of HMM profiles provided to the pipeline (see the `--hmmdir`, `--hmmpatern` and `--hmmfiles` parameters).
Output files -- `kofamscan/` - - `*.kofamscan_output.tsv.gz`: kofamscan output. +- `hmmer/` + - `*.tbl.gz`:
-#### EUKulele - -[EUKulele](https://github.com/AlexanderLabWHOI/EUKulele) will perform an analysis to assign a taxonomy to the ORFs +After the search, hits for each ORF and HMM will be summarised and ranked based on scores for the hits (see also output in [summary tables](#summary-tables)).
Output files -- `eukulele/assembler.orfcaller/mets_full/diamond/` - - `*.diamond.out.gz`: Diamond output -- `eukulele/assembler.orfcaller/taxonomy_estimation/` -- `*-estimated-taxonomy.out.gz`: EUKulele output +- `hmmrank/` + - `*.tsv.gz`: tab separeted file with the ranked ORFs for each HMM profile. + +
+ +## Metatdenovo output + +### Summary tables + +Consistently named and formated output tables in tsv format ready for further analysis. +Filenames start with assembly program and ORF caller, to allow reruns of the pipeline with different parameter settings without overwriting output files. + +
+Output file + +- `summary_tables/` + - `{assembler}.{orf_caller}.overall_stats.tsv.gz`: overall statistics from the pipeline, e.g. number of reads, number of called ORFs, number of reads mapping back to contigs/ORFs etc. + - `{assembler}.{orf_caller}.counts.tsv.gz`: read counts per ORF and sample. + - `{assembler}.{orf_caller}.emapper.tsv.gz`: reformatted output from EggNOG-mapper. + - `{assembler}.{orf_caller}.{db}_eukulele.tsv.gz`: taxonomic annotation per ORF for specific database. + - `{assembler}.{orf_caller}.prokka-annotations.tsv.gz`: reformatted annotation output from Prokka. + - `{assembler}.{orf_caller}.hmmrank.tsv.gz`: ranked summary table from HMMER results.
@@ -261,9 +279,9 @@ Automatically, the pipline will run Hmmrank in order to find the best rank for e Output files - `pipeline_info/` - - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. - - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline. - - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`. - - Parameters used by the pipeline run: `params.json`. + - reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. + - reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline. + - reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`. + - parameters used by the pipeline run: `params.json`.
diff --git a/docs/usage.md b/docs/usage.md index 194b0b49..67f5c1db 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -1,45 +1,55 @@ # nf-core/metatdenovo: Usage -## :warning: Please read this documentation on the nf-core website: [https://nf-co.re/metatdenovo/usage] (the link is not working) (https://nf-co.re/metatdenovo/usage) +## :warning: Please read this documentation on the nf-core website: [https://nf-co.re/metatdenovo/usage](https://nf-co.re/metatdenovo/usage) > _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ ## Introduction -## Samplesheet input +Metatdenovo is a workflow primarily designed for annotation of metatranscriptomes for which reference genomes are not available. +The approach is to first create an assembly, then call genes and finally quantify and annotate the genes. +Since the workflow includes gene callers and annotation tools and databases both for prokaryotes and eukaryotes, the workflow should be suitable for both +organism groups and mixed communities can be handled by trying different gene callers and comparing the results. -You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It must be a comma-separated file with 3 columns, and a header row as shown in the examples below +While the rationale for writing the workflow was metatranscriptomes, there is nothing in the workflow that precludes use for single organisms rather than +communities nor genomes rather than transcriptomes. +Instead, the workflow should be usable for any project in which a de novo assembly followed by quantification and annotation is suitable. + +## Running the workflow + +### Quickstart + +A typical command for running the workflow is: ```bash ---input '[path to samplesheet file]' +nextflow run nf-core/metatdenovo -profile docker --outdir results/ --input samples.csv ``` -### Multiple runs of the same sample +### Samplesheet input -The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes: +You will need to create a samplesheet with information about the samples you would like to analyse before running the pipeline. Use this parameter to specify its location. It must be a comma-separated file with 3 columns, and a header row as shown in the examples below -```console -sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz -CONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz +```bash +--input '[path to samplesheet file]' ``` -### Full samplesheet +#### Full samplesheet + + + -The pipeline will auto-detect whether a sample is single- or paired-end using the information provided in the samplesheet. The samplesheet can have as many columns as you desire, however, there is a strict requirement for the first 3 columns to match those defined in the table below. + -A final samplesheet file consisting of both single- and paired-end data may look something like the one below. This is for 6 samples, where `TREATMENT_REP3` has been sequenced twice. +A final samplesheet file consisting of samples taken at time 0 and 24 in triplicate may look like the one below. ```console sample,fastq_1,fastq_2 -CONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz -CONTROL_REP2,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz -CONTROL_REP3,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz -TREATMENT_REP1,AEG588A4_S4_L003_R1_001.fastq.gz, -TREATMENT_REP2,AEG588A5_S5_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L003_R1_001.fastq.gz, -TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, +T0a,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz +T0b,AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz +T0c,AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz +T24a,AEG588A4_S1_L002_R1_001.fastq.gz,AEG588A4_S1_L002_R2_001.fastq.gz +T24b,AEG588A5_S2_L002_R1_001.fastq.gz,AEG588A5_S2_L002_R2_001.fastq.gz +T24c,AEG588A6_S3_L002_R1_001.fastq.gz,AEG588A6_S3_L002_R2_001.fastq.gz ``` | Column | Description | @@ -50,105 +60,102 @@ TREATMENT_REP3,AEG588A6_S6_L004_R1_001.fastq.gz, An [example samplesheet](../assets/samplesheet.csv) has been provided with the pipeline. -## Filter/remove sequences from the samples (e.g. rRNA sequences with SILVA database) +#### Multiple runs of the same sample + +The `sample` identifiers have to be the same when you have re-sequenced the same sample more than once e.g. to increase sequencing depth. The pipeline will concatenate the raw reads before performing any downstream analysis. Below is an example for the same sample sequenced across 3 lanes: + +```console +sample,fastq_1,fastq_2 +T0a,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz +T0a,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz +T0a,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz +``` + +### Filter/remove sequences from the samples (e.g. rRNA sequences with SILVA database) The pipeline can remove potential contaminants using the BBduk program. Specify a fasta file, gzipped or not, with the --sequence_filter sequences.fasta parameter. For further documentation, see the [BBduk official website](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/). -## Digital normalization +### Digital normalization -Metatdenovo can perform "digital normalization" on the reads BEFORE the assembly. +Metatdenovo can perform "digital normalization" of the reads before the assembly. This will reduce coverage of highly abundant sequences and remove sequences that are below a threshold, and can be useful if the data set is too large to assemble but also potentially improve an assembly. -N.B. the digital normalization is done only for the assembly and the non-normalized sequences will be used for quantification. -There is one option for digital normalization in the pipeline: - -- bbnorm (`--bbnorm`) +N.B. the digital normalization is done only for the assembly and the full set of sequences will be used for quantification. +To turn on digital normalization, use the `--bbnorm` parameter and, if required, adjust the `--bbnorm_target` and `--bbnorm_min` parameters. > Please, check the [bbnorm](https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbnorm-guide/) documentation for further information about these programs and how digital normalization works. Remember to check [Parameters](https://nf-co.re/metatdenovo/parameters) page for the all options that can be used for this step. -## Assembler options - -By default, the pipeline uses Megahit (i.e. `--assembler megahit`) to assemble the cleaned and trimmed FastQ reads to create the reference genome. -Megahit is fast and it requires a not a lot of memory to run, typically is suggested to be used with prokaryotic samples. -The pipeline allows you to choose another assembler RNAspades, (i.e. `--assembler rnaspades` ), that is usually suggested to use for eukaryotic samples. -You can also choose to input contigs from an assembly that you made outside the pipeline using the `--assembly file.fna` (where `file.fna` is the name of a fasta file with contigs) option. - -> N.B. you can use `Megahit` for eukaryotic samples too, we just suggest what is the best option according to our experience (literature?). - -## Orf caller options +### Assembler options -By default, the pipeline uses prodigal (i.e. `--orf_caller prodigal` ) to generate the genome feature file (.gff) and to generate gene structure from the assembly. +By default, the pipeline uses Megahit (`--assembler megahit`) to assemble the cleaned and trimmed reads to create the reference contigs. +Megahit is fast and it does not require a lot of memory to run, making it ideal for large sets of samples. +The workflow also supports RNAspades, (`--assembler rnaspades` ) as an alternative. -Other orf caller options for running the pipeline are: - -- Prokka (`--orf_caller prokka`) +You can also choose to input contigs from an assembly that you made outside the pipeline using the `--assembly file.fna` (where `file.fna` is the name of a fasta file with contigs) option. -- Transdecoder (`--orf_caller transdecoder`) +### ORF caller options -> N.B. Prokka and prodigal are suggested to run with prokaryotes while transdecoder is specific for eukaryotes. +By default, the pipeline uses prodigal (`--orf_caller prodigal` ) to call genes/ORFs from the assembly. +This is suitable for prokaryotes, as is the Prokka alternative (`--orf_caller prokka`). +The latter uses Prodigal internally making it suitable for prokaryotic genes. +It also performs functional annotation of ORFs. -## Taxonomical annotation options +For eukaryotic genes, we recommend users to use Transdecoder (`--orf_caller transdecoder`) to call ORFs. -Metatdenovo uses `EUKulele` as the main program for taxonomy annotation. `EUKulele` can be run with different reference datasets. The default dataset is PhyloDB (i.e. `--eukulele_db phylodb` ) which works for mixed communities of prokaryotes and eukaryotes. +### Taxonomic annotation options -Other databases options for running the pipeline are: +Metatdenovo uses EUKulele as the main program for taxonomy annotation. +EUKulele can be run with different reference datasets. +The default dataset is PhyloDB (`--eukulele_db phylodb` ) which works for mixed communities of prokaryotes and eukaryotes. +Other database options for running the pipeline are MMETSP (`--eukulele_db mmetsp`; for marine protists) and GTDB (`--eukulele_db gtdb`; for prokarytes +[under development]). -- MMETSP (`--eukulele_db mmetsp`) +Options: -- GTDB (`--eukulele_db gtdb`) [under development] +- PhyloDB: default, covers both prokaryotes and eukaryotes +- MMETSP: marine protists +- GTDB: prokaryotes, both bacteria and archaea -PhyloDB and GTDB are recommended for prokaryotic datasets and MMETSP for eukaryotes, although PhyoDB can be also recognize eukaryotes and can be used for this purpose. +You can also provide your own database, see the [EUKulele documentation](https://eukulele.readthedocs.io/en/latest/#) documentation. -If you already have these databases ready in your working directory, you can point to the folder so the pipeline will not download the database (e.g. `--eukulele_dbpath your/path/database/`). N.B. When you are using a custom database, don't specify the `--eukulele_db` option. The pipeline will provide a default name for the database to avoid that EUKulele will try to download a new database. +Databases are automatically downloaded by the workflow, but if you already have them available you can use the `--eukulele_dbpath path/to/db` pointing +to the root directory of the EUKulele databases. +(The default for this parameter is `eukulele`.) > Please, check the [EUKulele documentation](https://eukulele.readthedocs.io/en/latest/#) for more information about the databases. -An alternative to EUKulele is the CAT program. In contrast to EUKulele that annotates open reading frames (ORFs), CAT annotates the contigs from the assembly. + + -These options are: +### Functional annotation options -- [Eggnog](https://github.com/eggnogdb/eggnog-mapper/wiki) (`--eggnog_dbpath`) +Besides the functional annotation that the gene caller Prokka gives (see above) there are two general purpose functional annotation programs available +in the workflow: the [eggNOG-mapper](http://eggnog-mapper.embl.de/) and [KofamScan](https://github.com/takaram/kofam_scan). +Both are suitable for both prokaryotic and eukaryotic genes and both are run by default, but can be skipped using the `--skip_eggnog` and +`--skip_kofamscan` options respectivelly. +The tools use large databases which are downloaded automatically but paths can be provided by the user through the `--eggnog_dbpath directory` +`--kofam_dir dir` parameters respectively. -- [hmmsearch](http://eddylab.org/software/hmmer/Userguide.pdf) (`--hmmdir` or `--hmmfiles`) - -- [kofamscan](https://github.com/takaram/kofam_scan) (`--kofam_dir`) - -All the options can run at the same time (e.g. `nextflow run main.nf -profile test,docker --eggnog --hmmdir hmms/ `) but each program has its own options that you will need to read carefully before running the pipeline. -You can find more information about the different options in the [parameters page](https://nf-co.re/metatdenovo/parameters). -For details about individual programs used, see their respective home pages. - -If an Eggnog, kofam, or EUKulele database is already available, they can be specified with the above commands to skip the automatic download that the pipeline performs. - -If you don't want run eggNOG-mapper, you will need to add the flag `--skip_eggnog`, otherwise metatdenovo will run the program automatically. +A more targeted annotation option offered by the workflow is the possibility for the user to provide a set of +[HMMER HMM profiles](http://eddylab.org/software/hmmer/Userguide.pdf) through the `--hmmdir dir` or `hmmfiles file0.hmm,file1.hmm,...,filen.hmm` parameters. +Each HMM file will be used to search the amino acid sequences of the ORF set and the results will be summarized in a tab separated file in which each +ORF-HMM combination will be ranked according to score and E-value. ## Example pipeline command with some common features -````nextflow -nextflow run lnuc-eemis/metatdenovo -profile docker --input samplesheet.csv --assembler rnaspades --orf_caller transdecoder --eggnog - -In this example, we are running metatdenovo with `rnaspades` as assembler, `transdecoder` as ORF caller and `eggnog` for functional annotation. - -## Running the pipeline - -The typical command for running the pipeline is as follows: - ```bash -nextflow run nf-core/metatdenovo --input ./samplesheet.csv --outdir ./results -profile docker -```` +nextflow run nf-core/metatdenovo -profile docker --input samplesheet.csv --assembler rnaspades --orf_caller prokka --eggnog --eukulele_db gtdb +``` -This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. +In this example, we are running metatdenovo with `rnaspades` as assembler, `prokka` as ORF caller, `eggnog` for functional annotation and EUKulele with the GTDB database for taxonomic annotation. Note that the pipeline will create the following files in your working directory: @@ -176,9 +183,11 @@ nextflow run nf-core/metatdenovo -profile docker -params-file params.yaml with `params.yaml` containing: ```yaml -input: './samplesheet.csv' -outdir: './results/' -genome: 'GRCh37' +input: 'samplesheet.csv' +assembler: 'rnaspades' +orf_caller: 'prokka' +eggnog: true +eukulele_db: 'gtdb' <...> ```