Configure and run MOSCA

MOSCA allows you to customize your analysis through a configuration file, written in JSON format. This configuration file specifies the parameters for the pipeline, such as input data and parameters of analysis.

Obtaining the Configuration File 🌐

To obtain the configuration file, you can visit MOSGUITO and follow these steps:

Go to "Configurations" -> "General configurations" and select the general parameters of analysis. Default values are already in place, and might not need to be changed.
The following three tabs - "UniProt columns", "UniProt databases" and "KEGG metabolic maps" - concern the selection of lists of columns and databases for which to obtain information from UniProt, and metabolic maps for which to represent genomic potential and gene expression.
Set, in the "Experiments" tab, the information concerning your datasets.
- "Files" - name of the files in the system where MOSCA will be run. Can link to folders, and can contain spaces. If those are short reads in paired-end format, input their filenames separated by a comma (",");
- "Sample" - files with the same "Sample" value are considered to belong to similar communities, and will be assembled and binned together. This allows to improve the quality of assembly/binning, but is not advised if communities are too different;
- "Data type" - "dna", "mrna" or "protein" - metagenomics, metatranscriptomics or metaproteomics - Whole Genome Shotgun or RNA-Seq Sequencings short reads, or LC-MS raw data (spectra);
- "Condition" - determines replicates for differential expression analysis - datasets with the same condition are considered replicates of each other. Only need to set it for "mrna" and "protein" datasets;
- "Name" - the name to output results concerning the dataset. If left blank, MOSCA will auto-determine a name.
You can then save the file to your local machine by going back to "General configuration" and clicking "Download JSON".

Available Parameters 🔍

These are all the parameters available for configuration in MOSCA. It might not be needed to configure all of them (proteomics parameters are useless when not inputting proteomics data) but MOSCA will check for them, so they must be kept.

Parameter	Value	Description
version	string	Minimum version of MOSCA to use the config file. This information is inputted automatically by MOSGUITO.
output	string	Name of folder where MOSCA's results will be stored (if it doesn't exist, it will be created).
resources_directory	string	Name of folder to store databases and other resources for MOSCA.
experiments	Dictionary	With the fields "Files", "Sample", "Data type", "Condition" and "Name", as per the file in the repo.
threads	int	Number of maximum threads for MOSCA to use.
minimum_read_length	int	Minimum length of reads to keep.
minimum_read_average_quality	int	Minimum average quality of reads to keep.
max_memory	int	Max memory that MOSCA can use (in Gb).
do_assembly	boolean	`true` if MOSCA should do assembly, `false` otherwise.
assembler	string	Name of assembler to use for iterative co-assembly of MG data. Possible values: `metaspades`, `megahit`
error_model	string	Name of error model for gene calling with FragGeneScan. Possible values are: `sanger_5`, `sanger_10`, `454_10`, `454_30`, `illumina_5`, `illumina_10` if either Sanger, pyro- or Illumina sequencing reads are the input to gene calling. The term before the `_` denote the technology used, and the term after denote the percentage of reads that are expected to be wrongly called. Set to `complete` if assembly was performed.
markerset	string	Name of markerset to use for completeness/contamination estimation with CheckM over the contigs obtained with MaxBin2. `40` if Archaea are significantly present in the community, `107` if the study only concerns Bacteria.
do_binning	boolean	`true` if binning is to be performed, `false` otherwise.
do_iterative_binning	boolean	`true` if iterative binning is to be performed, `false` otherwise.
split_gene_calling	int	In how many parts should the result of gene calling be split. Used to save disk space when DIAMOND is running, only needed if disk runs out of space.
upimapi_database	string	Name of FASTA or DMND (DIAMOND formatted database) file to use as input for annotation with DIAMOND. If download UniProt parameter is "true", MOSCA will use downloaded UniProt database instead.
upimapi_max_target_seqs	int	Number of matches to report for each protein from annotation with DIAMOND.
upimapi_taxids	string	Comma-separated list of Tax IDs to build "taxids" database of UPIMAPI. It should contain all the Tax IDs of taxonomies present in datasets.
upimapi_check_db	boolean	`true` if UniProtKB (SwissProt + TrEMBL) is to be downloaded, `false` if it is already downloaded. Will download to folder indicated with "resources_directory" parameter.
uniprot_columns	list	Columns to obtain information of through UniProt's ID mapping service. Options are available here.
download_cdd_resources	boolean	`true` if reCOGnizer should download resources, `false` otherwise.
recognizer_databases	list	Databases to obtain information from with reCOGnizer.
normalization_method	string	`TMM` (Trimmed Mean of the M-values), `RLE` (Relative Log Expression) or `VSN` (Variance Stabilizing Normalization) method to use for normalization.
minimum_differential_expression	float	DESeq2's differential comparison significance is calculated for a specific ratio. This sets the value for which the null hypothesis will be calculated.
metaproteomics_add_reference_proteomes	boolean	`true` if reference proteomes of identified taxa should be added to the MP database, `false` otherwise.
reference_proteomes_taxa_level	string	One of `SUPERKINGDOM`, `PHYLUM`, `CLASS`, `ORDER`, `FAMILY`, `GENUS`, `SPECIES` to download reference reference proteomes from for MP database.
use_crap	boolean	`true` if cRAP should be added to MP database, `false` otherwise.
proteomics_contaminants_database	string	To set a different contaminants database instead of cRAP.
protease	string	If `Trypsin`, will consider pig's trypsin as protease, and add Trypsin's sequence to MP database.
protease file	string	If not using Trypsin as protease, specify the path to the FASTA file of the protease.
false_discovery_rate	float	What local FDR to consider when selecting PSMs in protein inference.
keggcharter_taxa_level	string	The taxonomic level to represent with KEGGCharter. Choose one of `SPECIES`, `GENUS`, `FAMILY`, `ORDER`, `CLASS`, `PHYLUM` or `SUPERKINGDOM`
keggcharter_number_of_taxa	int	How many of the most abundant taxa should be represented with KEGGCharter, ideally set under 11
keggcharter_maps	list	IDs of metabolic maps where to chart information. Maps will be generated with genomic potential represented if MG data is available, and gene/protein expression if MT/MP data is available, respectively.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configure and run MOSCA

Obtaining the Configuration File 🌐

Available Parameters 🔍

Clone this wiki locally