The pacmill
python package is a bioinformatics pipeline that is developed to process microbial 16S amplicon sequencing data. It is specialized in the analysis of long reads such as those provided by PacBio sequencers.
Since pacmill
is written in python, it is compatible with all operating systems: Linux, macOS and Windows. The only prerequisite is python3
(which is often installed by default) along with the pip3
package manager.
To check if you have python3
installed, type the following on your terminal:
$ python3 -V
If you do not have python3
installed, please refer to the section obtaining python3.
To check if you have pip3
installed, type the following on your terminal:
$ pip3 -V
If you do not have pip3
installed, please refer to the section obtaining pip3.
To install the pacmill
package, simply type the following commands on your terminal:
$ pip3 install --user pacmill
Alternatively, if you want to install it for all users of the system:
$ sudo pip3 install pacmill
These commands will also automatically install all the other python modules on which pacmill
depends.
The pacmill
pipeline also depends on several shell commands being available. The following executables should be present in your $PATH
environment variable:
fastQValidator
,fastqc
,barrnap
,vsearch
,mothur
,xelatex
,fastq-dump
If any of these required external programs are missing, you will be prompted to install them and given easy instructions to do so.
The first thing to do when starting a new analysis is to fill in a metadata file that details all there is to know about the biological samples being processed.
An empty template for such a file is found under this repository at pacmill/metadata/metadata_blank.xlsx
. You can make a copy of this file for every new project.
In addition, another file named pacmill/metadata/metadata_example.xlsx
shows typical values that the fields are supposed to take along with a short documentation for each entry. A excerpt of this file is shown below:
Bellow are some examples to illustrate the various ways there are to use this package.
# This example is not completed yet. TODO.
To change the text that appears inside the header of the PDF reports generated, you can adjust these three environment variables to your liking. Credit is appreciated where credit is due, but the software has a very permissive license that lets you decide what is best.
$ export PACMILL_HEADER="From the \textbf{pacmill} project"
$ export PACMILL_SUBHEADER="Written by consultants at \url{www.sinclair.bio}""
$ export PACMILL_LINK="Hosted at \url{www.github.com/xapple/pacmill}"
In order to test and evaluate the pipeline, we have provided a demonstration project ready to be processed. This enables the user to see what type of outputs are generated by pacmill
without having to bring his own DNA sequence data. Five samples are included and are taken from the following publication:
- "Confident phylogenetic identification of uncultured prokaryotes through long read amplicon sequencing of the 16S-ITS-23S rRNA operon.''
- Joran Martijn, [many others], Thijs Ettema.
- Science for Life Laboratory, Uppsala University
- https://doi.org/10.1111/1462-2920.14636
The samples are publicly accessible on SRA and are described as follows:
mock
: Genomic DNA from 38 phylogenetically distinct and diverse bacteria and archaea.p19
: Sediment sample obtained from hot spring Radiata Pool, Ngatamariki, New Zealand.pm3
: Sediment sample taken from 1.25m below the sea floor using a gravity core at Aarhus Bay, Denmark.sala
: Black biofilm that was taken at 60m depth in an old silver mine near Sala, Sweden.tns08
: Sediment sample taken from a shallow submarine hydrothermal vent field near Taketomi Island, Japan.
To run the demo project, do the following:
# This example is not completed yet. TODO.
The pacmill
pipeline produces a multitude of graphs and visualizations after having processed the sequence data. Below are two examples. Firstly a sequence length distribution of cleaned reads. Secondly, a barstack of taxonomic assignments for five different samples at the phylum level.
After running the pipeline on a set of FASTQ files, several PDF reports are auto-generated. Examples of three reports are given below. The first concerns an individual sample while the second details the results of a project containing several samples. The third focuses on taxonomic assignment results and visualizations.
Project report |
Sample report |
Taxonomy report |
Below is presented a flowchart detailing the multiple processing steps that occur in the pacmill
pipeline in a chronological order.
More documentation is available at:
http://xapple.github.io/pacmill/pacmill
This documentation is simply generated from the source code with:
$ pdoc --html --output-dir docs --force pacmill