I built this snakemake pipeline to showcase how FASTQ files can be taken all the way into a set of good-quality dereplicated MAGs.
The general steps are:
- Quality checked the reads using
fastQC
- Assembled the
fastq
reads usingSPADES
- Bin the MAGs using
metawrap
- Refining MAGs using
dasTool
- Deplicate the MAGs (if relevant) using
dRep
. - Determine MAG quality using
checkM
- Select only MIMAG quality-standard MAGs for further analyses (e.g. >50% complete, <10% contamination).
- Assign taxonomy of this MAG set using
gtdbtk
.
https://github.com/patriciatran/MAG_pipeline/blob/main/example_results_folder.txt
- results/{sample}/final_bin_set/.fasta* : all the final bins in FASTA format
- results/{sample}/taxonomy_final_bin_set.tsv : final GTDBTK taxonomic assignment for the final bin set of MAGs
April 17, 2023
- Pipeline works without errors!
- Next step: improving documentation and distribute as a package.
- Add ways to report final information : e.g. run time of the pipeline, how many MAGs in the final bin set for each sample.
- Add ways to report final information: e.g. bar plot of taxonomies across samples
This pipeline exists because of the folks making these programs available, please cite their work:
-
SPADES: https://github.com/ablab/spades
-
Metawrap: https://github.com/bxlab/metaWRAP
-
Metabat1 and Metabat2: https://bitbucket.org/berkeleylab/metabat
-
DasTool: https://github.com/cmks/DAS_Tool
-
Snakemake: https://snakemake.readthedocs.io/en/stable/
-
conda/anaconda: https://docs.anaconda.com/anaconda/user-guide/faq/
MIMAG Standards: https://www.nature.com/articles/nbt.3893
Tisza MJ et al., "A catalog of tens of thousands of viruses from human metagenomes reveals hidden associations with chronic diseases.", Proc Natl Acad Sci U S A, 2021 Jun 8;118(23)