-
Notifications
You must be signed in to change notification settings - Fork 12
Tutorial
usage: FeGenie.py [-h] [-DB DB] [-bin_dir BIN_DIR] [-bin_ext BIN_EXT] [-contigs_source CONTIGS_SOURCE] [-bit BIT] [-d D] [-pfam PFAM] [-nr NR] [-out OUT] [-inflation INFLATION] [-t T]
Developed by Arkadiy I. Garber, Kenneth H. Nealson, and Nancy Merino; University of Southern California, Earth Sciences Please send comments and inquiries to arkadiyg@usc.edu
optional arguments: -h, --help show this help message and exit
-DB HMM database; directory titled 'HMM-lib', can be found in the FerrJinn-master folder
-bin_dir. directory of bins
-bin_ext extension for bins (do not include the period)
-contigs_source are the provided contigs from a single organism (single)or metagenomic/metatranscriptomic assemblies (meta)? (default=single)
-bit tsv file with bitscore cut-offs for all HMMs
-d maximium distance between genes to be considered in a genomic 'cluster'.This number should be an integer and should reflect the maximum number of genes in between putative iron-related genes identified by the HMM database (default=10)
-pfam location of Pfam HMM (optional, if you want the identified candidateiron genes to be compared against Pfam)
-nr path to NCBI's nr database (optional, if you want the identified candidate iron genes compared against NCBI)
-out name of output file; please provide full path (default=fegenie_out)
-inflation inflation factor for final gene category counts (default=1000)
-t number of threads to use for DIAMOND BLAST and HMMSEARCH (default=1, max=16)
The input to this program is simply a folder of FASTA files. The FASTA files should contain contigs from a single genome or a metagenomic assembly. There are two other inputs to this program, which are provided here. These inputs are 1) The pHMM library and 2) a text file containing calibrated bitscore cutoffs for each HMM. The workflow of this program is described in the PDF titled "workflow". it is pfarily straightforward, but if you, the user, have any questions, feel free to shoot me an email (arkadiyg@usc.edu).
python3 FeGenie.v.5.py -DB HMM-lib/ -bin_dir your/genomes/or/assemblies/ -bin_ext fa -contigs_source single -bit HMM-bitcutoffs.txt -d 10 -out fegenie-out -inflation 100 -t 4
-
The '-d' option controls the distance between genes that will is allowable for cluster-identication. In the sample command above, that paramters is set to 10, meaning that if two genes have less than 10 genes between them, they will be considered as part of a potential cluster.
-
The '-inflation' parameters dictates the values that are outputed in the heatmap. For the heatmap, the data are normalized to the number of ORFs present in each genome or assembly; this is done by dividing the number of genes identified for a particular iron-related functional category for each genome/assembly by the total number of ORFs present in that genome/assembly. The end result of this is a really small number, which represents the proportion of each genome/assembly that is dedicated to a particular iron-related functional category. You can convert this to a larger number that is easier to look at; so, in the above command '-inflation 100' means that that small number will be multiplied by 100, essentially giving the "percentage" of each genome/assembly dedicated to a particular iron-related functional category. You can give this option any number you would like.
-
The '-out' parameter sets the name of the output directory that will be created by the program. All the output files will be in there. There should be two files (a CSV summary file and a CSV heatmap-compatible file) and one folder, which will contain all the ORF-calls for your genomes or assemblies.
-
The 'contigs_source' parameter is mainly for Prodigal. If you are running the progra on genomes, then set this parameter as "single", which is the default for this program. If you are running this program on metagenome assemblies, you should set that parameter as "meta", in which case, Prodigal will be run with the '-p meta' option.
-
The 'bin_ext' parameter allows you to specify which files in a directory to analyze. This is useful in case you have files in your input directory that are not FASTA files. Even if you have only FASTA files in the directory though, you should still provide this paramter with the extension for your genomes or assmblies. Do not include the period in this case; so if your files are, for example, genomeA.faa, genomeB.faa, etc., the paramter should look like this: "-bin_ext faa"
-
The '-t' option allows you to set how many threads are to be used for the hmmsearch and blastp commands that are used in the script.