Skip to content

Tutorial

Arkadiy-Garber edited this page Jan 20, 2019 · 20 revisions

Developed by Arkadiy I. Garber, Kenneth H. Nealson, and Nancy Merino; University of Southern California, Earth Sciences Please send comments and inquiries to arkadiyg@usc.edu


Mandatory Arguments:

-h, --help show this help message and exit

-DB HMM database; directory titled 'HMM-lib', can be found in the FerrJinn-master folder

-bin_dir. directory of genomes or metagenome assemblies

-bin_ext file extension for FASTA files in the input directory (do not include the period)

Optional Arguments:

-contigs_source are the provided contigs from a single organism (single)or metagenomic/metatranscriptomic assemblies (meta)? (default=single)

-d maximium distance between genes to be considered in a genomic 'cluster'.This number should be an integer and should reflect the maximum number of genes in between putative iron-related genes identified by the HMM database (default=10)

-ref path to a reference protein database (in FASTA format)

-out name of output file, default=fegenie_out)

-inflation inflation factor for final gene category counts (default=1000)

-t number of threads to use for DIAMOND BLAST and HMMSEARCH (default=1, max=16)

Input

The input to this program is simply a folder of FASTA files. The FASTA files should contain contigs from a single genome or a metagenomic assembly. There are two other inputs to this program, which are provided here. These inputs are 1) The pHMM library and 2) a text file containing calibrated bitscore cutoffs for each HMM. The workflow of this program is described in the PDF titled "workflow". it is pfarily straightforward, but if you, the user, have any questions, feel free to shoot me an email (arkadiyg@usc.edu).

Sample command (simplest form):

FeGenie.py -bin_dir path/to/your/genomes/directory/ -bin_ext fa -out fegenie_output
  • The '-d' option controls the distance between genes that will is allowable for cluster-identication. In the sample command above, that paramters is set to 10, meaning that if two genes have less than 10 genes between them, they will be considered as part of a potential cluster.

  • The '-inflation' parameters dictates the values that are outputed in the heatmap. For the heatmap, the data are normalized to the number of ORFs present in each genome or assembly; this is done by dividing the number of genes identified for a particular iron-related functional category for each genome/assembly by the total number of ORFs present in that genome/assembly. The end result of this is a really small number, which represents the proportion of each genome/assembly that is dedicated to a particular iron-related functional category. You can convert this to a larger number that is easier to look at; so, in the above command '-inflation 100' means that that small number will be multiplied by 100, essentially giving the "percentage" of each genome/assembly dedicated to a particular iron-related functional category. You can give this option any number you would like.

  • The '-out' parameter sets the name of the output directory that will be created by the program. All the output files will be in there. There should be two files (a CSV summary file and a CSV heatmap-compatible file) and one folder, which will contain all the ORF-calls for your genomes or assemblies.

  • The 'contigs_source' parameter is mainly for Prodigal. If you are running the progra on genomes, then set this parameter as "single", which is the default for this program. If you are running this program on metagenome assemblies, you should set that parameter as "meta", in which case, Prodigal will be run with the '-p meta' option.

  • The 'bin_ext' parameter allows you to specify which files in a directory to analyze. This is useful in case you have files in your input directory that are not FASTA files. Even if you have only FASTA files in the directory though, you should still provide this paramter with the extension for your genomes or assmblies. Do not include the period in this case; so if your files are, for example, genomeA.faa, genomeB.faa, etc., the paramter should look like this: "-bin_ext faa"

  • The '-t' option allows you to set how many threads are to be used for the hmmsearch and blastp commands that are used in the script.

Clone this wiki locally