Skip to content
/ pacmill Public

The `pacmill` python package is a bioinformatics pipeline that is developed to process microbial 16S amplicon sequencing data and produce PDF reports detailing taxonomical assignments along with other statistics.

License

Notifications You must be signed in to change notification settings

xapple/pacmill

Repository files navigation

PyPI version

pacmill version 0.5.4

The pacmill python package is a bioinformatics pipeline that is developed to process microbial 16S amplicon sequencing data. It is specialized in the analysis of long reads such as those provided by PacBio sequencers.

Prerequisites

Since pacmill is written in python, it is compatible with all operating systems: Linux, macOS and Windows. The only prerequisite is python3 (which is often installed by default) along with the pip3 package manager.

To check if you have python3 installed, type the following on your terminal:

$ python3 -V

If you do not have python3 installed, please refer to the section obtaining python3.

To check if you have pip3 installed, type the following on your terminal:

$ pip3 -V

If you do not have pip3 installed, please refer to the section obtaining pip3.

Installing

To install the pacmill package, simply type the following commands on your terminal:

$ pip3 install --user pacmill

Alternatively, if you want to install it for all users of the system:

$ sudo pip3 install pacmill

These commands will also automatically install all the other python modules on which pacmill depends.

External programs

The pacmill pipeline also depends on several shell commands being available. The following executables should be present in your $PATH environment variable:

  • fastQValidator, fastqc, barrnap, vsearch, mothur, xelatex, fastq-dump

If any of these required external programs are missing, you will be prompted to install them and given easy instructions to do so.

Usage

Metadata

The first thing to do when starting a new analysis is to fill in a metadata file that details all there is to know about the biological samples being processed.

An empty template for such a file is found under this repository at pacmill/metadata/metadata_blank.xlsx. You can make a copy of this file for every new project.

In addition, another file named pacmill/metadata/metadata_example.xlsx shows typical values that the fields are supposed to take along with a short documentation for each entry. A excerpt of this file is shown below:

Loading the project

Bellow are some examples to illustrate the various ways there are to use this package.

# This example is not completed yet. TODO.

Customizing report headers

To change the text that appears inside the header of the PDF reports generated, you can adjust these three environment variables to your liking. Credit is appreciated where credit is due, but the software has a very permissive license that lets you decide what is best.

$ export PACMILL_HEADER="From the \textbf{pacmill} project"
$ export PACMILL_SUBHEADER="Written by consultants at \url{www.sinclair.bio}""
$ export PACMILL_LINK="Hosted at \url{www.github.com/xapple/pacmill}"

Demo project

In order to test and evaluate the pipeline, we have provided a demonstration project ready to be processed. This enables the user to see what type of outputs are generated by pacmill without having to bring his own DNA sequence data. Five samples are included and are taken from the following publication:

  • "Confident phylogenetic identification of uncultured prokaryotes through long read amplicon sequencing of the 16S-ITS-23S rRNA operon.''
  • Joran Martijn, [many others], Thijs Ettema.
  • Science for Life Laboratory, Uppsala University
  • https://doi.org/10.1111/1462-2920.14636

The samples are publicly accessible on SRA and are described as follows:

  • mock: Genomic DNA from 38 phylogenetically distinct and diverse bacteria and archaea.
  • p19: Sediment sample obtained from hot spring Radiata Pool, Ngatamariki, New Zealand.
  • pm3: Sediment sample taken from 1.25m below the sea floor using a gravity core at Aarhus Bay, Denmark.
  • sala: Black biofilm that was taken at 60m depth in an old silver mine near Sala, Sweden.
  • tns08: Sediment sample taken from a shallow submarine hydrothermal vent field near Taketomi Island, Japan.

To run the demo project, do the following:

# This example is not completed yet. TODO.

Example graphs

The pacmill pipeline produces a multitude of graphs and visualizations after having processed the sequence data. Below are two examples. Firstly a sequence length distribution of cleaned reads. Secondly, a barstack of taxonomic assignments for five different samples at the phylum level.

Example reports

After running the pipeline on a set of FASTQ files, several PDF reports are auto-generated. Examples of three reports are given below. The first concerns an individual sample while the second details the results of a project containing several samples. The third focuses on taxonomic assignment results and visualizations.

Project report

Sample report

Taxonomy report

Flowchart

Below is presented a flowchart detailing the multiple processing steps that occur in the pacmill pipeline in a chronological order.

Extra documentation

More documentation is available at:

http://xapple.github.io/pacmill/pacmill

This documentation is simply generated from the source code with:

$ pdoc --html --output-dir docs --force pacmill

About

The `pacmill` python package is a bioinformatics pipeline that is developed to process microbial 16S amplicon sequencing data and produce PDF reports detailing taxonomical assignments along with other statistics.

Topics

Resources

License

Stars

Watchers

Forks