Skip to content

Tutorial

Arkadiy-Garber edited this page Apr 13, 2020 · 20 revisions

Please send comments and inquiries to rkdgarber@gmail.com


Mandatory Arguments:

-h, --help show this help message and exit

-bin_dir. directory of genomes or metagenome assemblies

-bin_ext file extension for FASTA files in the input directory (do not include the period)

Optional Arguments:

-d maximium distance between genes to be considered in a genomic 'cluster'.This number should be an integer and should reflect the maximum number of genes in between putative iron-related genes identified by the HMM database (default=10)

-ref path to a reference protein database (in FASTA format)

-out name of output file, default=fegenie_out)

-inflation inflation factor for final gene category counts (default=1000)

-t number of threads to use for DIAMOND BLAST and HMMSEARCH (default=1, max=16)

-bams a tab-delimited file with two columns: first column has the genome or metagenome file names; second column has the corresponding BAM file (provide full path to the BAM file). Use this option if you have genomes that each have different BAM files associated with them. If you have a set of bins from a single metagenome sample and, thus, have only one BAM file, then use the '-bam' option. BAM files are only required if you would like to create a heatmap that summarizes the abundance of a certain gene that is based on read coverage, rather than gene counts. See master repository for sample file to provide to the '-bams' argument

-bam BAM file. This option is only required if you would like to create a heatmap that summarizes the abundance of a certain gene that is based on read coverage, rather than gene counts. If you have more than one BAM file corresponding to different genomes that you are providing, please use the '-bams' argument to provide a tab-delimited file that denotes which BAM file (or files) belongs with which genome.

Input

The input to this program is simply a folder of FASTA files. The FASTA files should contain contigs from a single genome or a metagenomic assembly. There are two other inputs to this program, which are provided here. These inputs are 1) The pHMM library and 2) a text file containing calibrated bitscore cutoffs for each HMM. The workflow of this program is described in the PDF titled "workflow". it is pfarily straightforward, but if you, the user, have any questions, feel free to shoot me an email (arkadiyg@usc.edu).

Sample command (simplest form):

FeGenie.py -bin_dir path/to/your/genomes/directory/ -bin_ext fa -out fegenie_output

Sample command (if providing gene-calls):

FeGenie.py -bin_dir path/to/your/genomes/directory/ -bin_ext faa -out fegenie_output --orfs

Sample command (if providing gbk files):

FeGenie.py -bin_dir path/to/your/genomes/directory/ -bin_ext gbk -out fegenie_output --gbk

Sample command (with more parameters specified):

FeGenie.py -bin_dir path/to/your/genomes/directory/ -bin_ext fa -out fegenie_output -ref path/to/protein/database/RefSeqDB.faa -t 16 -inflation 100 -d 5 --meta

Note: In the above command, we specified that we are analyzing metagenome assemblies, rather than single genomes or bins. We also constrained gene operon sizes by stating that no more than 5 genes may be in between two iron-genes to be put into a gene "cluster". The -inflation' argument was give so that in the heatmap, each iron category will be displayed as a "percentage of genome". Finally, we specified a path to a reference protein database for cross-validation of putative iron genes.

  • The '-d' option controls the distance between genes that will is allowable for cluster-identication. In the sample command above, that paramters is set to 10, meaning that if two genes have less than 10 genes between them, they will be considered as part of a potential cluster.

  • The '-inflation' parameters dictates the values that are outputed in the heatmap. For the heatmap, the data are normalized to the number of ORFs present in each genome or assembly; this is done by dividing the number of genes identified for a particular iron-related functional category for each genome/assembly by the total number of ORFs present in that genome/assembly. The end result of this is a really small number, which represents the proportion of each genome/assembly that is dedicated to a particular iron-related functional category. You can convert this to a larger number that is easier to look at; so, in the above command '-inflation 100' means that that small number will be multiplied by 100, essentially giving the "percentage" of each genome/assembly dedicated to a particular iron-related functional category. You can give this option any number you would like.

  • The '-out' parameter sets the name of the output directory that will be created by the program. All the output files will be in there. There should be two files (a CSV summary file and a CSV heatmap-compatible file) and one folder, which will contain all the ORF-calls for your genomes or assemblies.

  • The '--meta' parameter is for Prodigal. If you are running the software on metagenome assemblies, you should include this flag, in which case, Prodigal will be run with the '-p meta' option.

  • The 'bin_ext' parameter allows you to specify which files in a directory to analyze. This is useful in case you have files in your input directory that are not FASTA files. Even if you have only FASTA files in the directory though, you should still provide this parameter with the extension for your genomes or assmblies. Do not include the period in this case; so if your files are, for example, genomeA.faa, genomeB.faa, etc., the paramter should look like this: "-bin_ext faa"

  • The '-t' option allows you to set how many threads are to be used for the hmmsearch and blastp commands that are used in the script.

Output

The output of this program features three files:

  1. FeGenie-geneSummary.csv
  2. FeGenie-geneSummary-clusters.csv
  3. FeGenie-heatmap.csv

The 'geneSummary' files are essentially the same, except that the 'geneSummary-clusters' files features a row of '#' symbols in between each identified gene or gene neighborhood. This is done to make it easier to visually inspect the output.

The "geneSummary" output files should have 9 columns. The columns represent the following values:

  1. Iron-gene category
  2. Input file-name
  3. Predicted identifier for open-reading frame (ORF)
  4. Matching HMM family
  5. Bit score for HMM match
  6. Trusted bit score cut-off for HMM family
  7. Unique gene neighborhood identifier (arbitrary number, intended to be used if parsing the output file)
  8. Number of predicted heme-binding motifs in the identified gene.
  9. Amino acid sequence
Clone this wiki locally