Skip to content

Tutorial

Arkadiy-Garber edited this page Apr 13, 2020 · 20 revisions

Please send comments and inquiries to rkdgarber@gmail.com


Mandatory Arguments:

-h, --help show this help message and exit

-bin_dir. directory of genomes or metagenome assemblies

-bin_ext file extension for FASTA files in the input directory (do not include the period)

Optional Arguments:

-contigs_source are the provided contigs from a single organism (single)or metagenomic/metatranscriptomic assemblies (meta)? (default=single)

-d maximium distance between genes to be considered in a genomic 'cluster'.This number should be an integer and should reflect the maximum number of genes in between putative iron-related genes identified by the HMM database (default=10)

-ref path to a reference protein database (in FASTA format)

-out name of output file, default=fegenie_out)

-inflation inflation factor for final gene category counts (default=1000)

-t number of threads to use for DIAMOND BLAST and HMMSEARCH (default=1, max=16)

-bam a tab-delimited file with two columns: first column has the genome or metagenome file names; second column has the corresponding BAM file

-bams number of threads to use for DIAMOND BLAST and HMMSEARCH (default=1, max=16)

Input

The input to this program is simply a folder of FASTA files. The FASTA files should contain contigs from a single genome or a metagenomic assembly. There are two other inputs to this program, which are provided here. These inputs are 1) The pHMM library and 2) a text file containing calibrated bitscore cutoffs for each HMM. The workflow of this program is described in the PDF titled "workflow". it is pfarily straightforward, but if you, the user, have any questions, feel free to shoot me an email (arkadiyg@usc.edu).

Sample command (simplest form):

FeGenie.py -bin_dir path/to/your/genomes/directory/ -bin_ext fa -out fegenie_output

Sample command (with more parameters specified):

FeGenie.py -bin_dir path/to/your/genomes/directory/ -bin_ext fa -out fegenie_output -ref path/to/protein/database/RefSeqDB.faa -t 16 -inflation 100 -d 5 -contigs_source meta

Be sure to run FeGenie.py as: ./FeGenie if is not in your PATH

Note: In the above command, we specified that we are analyzing metagenome assemblies, rather than single genomes or bins. We also constrained gene operon sizes by stating that no more than 5 genes may be in between two iron-genes to be put into a gene "cluster". The -inflation' argument was give so that in the heatmap, each iron category will be displayed as a "percentage of genome". Finally, we specified a path to a reference protein database for cross-validation of putative iron genes.

  • The '-d' option controls the distance between genes that will is allowable for cluster-identication. In the sample command above, that paramters is set to 10, meaning that if two genes have less than 10 genes between them, they will be considered as part of a potential cluster.

  • The '-inflation' parameters dictates the values that are outputed in the heatmap. For the heatmap, the data are normalized to the number of ORFs present in each genome or assembly; this is done by dividing the number of genes identified for a particular iron-related functional category for each genome/assembly by the total number of ORFs present in that genome/assembly. The end result of this is a really small number, which represents the proportion of each genome/assembly that is dedicated to a particular iron-related functional category. You can convert this to a larger number that is easier to look at; so, in the above command '-inflation 100' means that that small number will be multiplied by 100, essentially giving the "percentage" of each genome/assembly dedicated to a particular iron-related functional category. You can give this option any number you would like.

  • The '-out' parameter sets the name of the output directory that will be created by the program. All the output files will be in there. There should be two files (a CSV summary file and a CSV heatmap-compatible file) and one folder, which will contain all the ORF-calls for your genomes or assemblies.

  • The '--meta' parameter is for Prodigal. If you are running the software on metagenome assemblies, you should include this flag, in which case, Prodigal will be run with the '-p meta' option.

  • The 'bin_ext' parameter allows you to specify which files in a directory to analyze. This is useful in case you have files in your input directory that are not FASTA files. Even if you have only FASTA files in the directory though, you should still provide this parameter with the extension for your genomes or assmblies. Do not include the period in this case; so if your files are, for example, genomeA.faa, genomeB.faa, etc., the paramter should look like this: "-bin_ext faa"

  • The '-t' option allows you to set how many threads are to be used for the hmmsearch and blastp commands that are used in the script.

Output

The output of this program features three files:

  1. FeGenie-geneSummary.csv
  2. FeGenie-geneSummary-clusters.csv
  3. FeGenie-heatmap.csv

The 'geneSummary' files are essentially the same, except that the 'geneSummary-clusters' files features a row of '#' symbols in between each identified gene or gene neighborhood. This is done to make it easier to visually inspect the output.

The "geneSummary" output files should have 9 columns. The columns represent the following values:

  1. Iron-gene category
  2. Input file-name
  3. Predicted identifier for open-reading frame (ORF)
  4. Matching HMM family
  5. Bit score for HMM match
  6. Trusted bit score cut-off for HMM family
  7. Unique gene neighborhood identifier (arbitrary number, intended to be used if parsing the output file)
  8. Number of predicted heme-binding motifs in the identified gene.
  9. Amino acid sequence
Clone this wiki locally