Skip to content

Configure and run MOSCA

João Sequeira edited this page Jun 14, 2024 · 1 revision

MOSCA allows you to customize your analysis through a configuration file, written in JSON format. This configuration file specifies the parameters for the pipeline, such as input data and parameters of analysis.

Obtaining the Configuration File 🌐

To obtain the configuration file, you can visit MOSGUITO and follow these steps:

  1. Go to "Configurations" -> "General configurations" and select the general parameters of analysis. Default values are already in place, and might not need to be changed.

  2. The following three tabs - "UniProt columns", "UniProt databases" and "KEGG metabolic maps" - concern the selection of lists of columns and databases for which to obtain information from UniProt, and metabolic maps for which to represent genomic potential and gene expression.

  3. Set, in the "Experiments" tab, the information concerning your datasets.

    • "Files" - name of the files in the system where MOSCA will be run. Can link to folders, and can contain spaces. If those are short reads in paired-end format, input their filenames separated by a comma (",");
    • "Sample" - files with the same "Sample" value are considered to belong to similar communities, and will be assembled and binned together. This allows to improve the quality of assembly/binning, but is not advised if communities are too different;
    • "Data type" - "dna", "mrna" or "protein" - metagenomics, metatranscriptomics or metaproteomics - Whole Genome Shotgun or RNA-Seq Sequencings short reads, or LC-MS raw data (spectra);
    • "Condition" - determines replicates for differential expression analysis - datasets with the same condition are considered replicates of each other. Only need to set it for "mrna" and "protein" datasets;
    • "Name" - the name to output results concerning the dataset. If left blank, MOSCA will auto-determine a name.
  4. You can then save the file to your local machine by going back to "General configuration" and clicking "Download JSON".

Available Parameters 🔍

These are all the parameters available for configuration in MOSCA. It might not be needed to configure all of them (proteomics parameters are useless when not inputting proteomics data) but MOSCA will check for them, so they must be kept.

Parameter Value Description
version string Minimum version of MOSCA to use the config file. This information is inputted automatically by MOSGUITO.
output string Name of folder where MOSCA's results will be stored (if it doesn't exist, it will be created).
resources_directory string Name of folder to store databases and other resources for MOSCA.
experiments Dictionary With the fields "Files", "Sample", "Data type", "Condition" and "Name", as per the file in the repo.
threads int Number of maximum threads for MOSCA to use.
minimum_read_length int Minimum length of reads to keep.
minimum_read_average_quality int Minimum average quality of reads to keep.
max_memory int Max memory that MOSCA can use (in Gb).
do_assembly boolean true if MOSCA should do assembly, false otherwise.
assembler string Name of assembler to use for iterative co-assembly of MG data. Possible values: metaspades, megahit
error_model string Name of error model for gene calling with FragGeneScan. Possible values are: sanger_5, sanger_10, 454_10, 454_30, illumina_5, illumina_10 if either Sanger, pyro- or Illumina sequencing reads are the input to gene calling. The term before the _ denote the technology used, and the term after denote the percentage of reads that are expected to be wrongly called. Set to complete if assembly was performed.
markerset string Name of markerset to use for completeness/contamination estimation with CheckM over the contigs obtained with MaxBin2. 40 if Archaea are significantly present in the community, 107 if the study only concerns Bacteria.
do_binning boolean true if binning is to be performed, false otherwise.
do_iterative_binning boolean true if iterative binning is to be performed, false otherwise.
split_gene_calling int In how many parts should the result of gene calling be split. Used to save disk space when DIAMOND is running, only needed if disk runs out of space.
upimapi_database string Name of FASTA or DMND (DIAMOND formatted database) file to use as input for annotation with DIAMOND. If download UniProt parameter is "true", MOSCA will use downloaded UniProt database instead.
upimapi_max_target_seqs int Number of matches to report for each protein from annotation with DIAMOND.
upimapi_taxids string Comma-separated list of Tax IDs to build "taxids" database of UPIMAPI. It should contain all the Tax IDs of taxonomies present in datasets.
upimapi_check_db boolean true if UniProtKB (SwissProt + TrEMBL) is to be downloaded, false if it is already downloaded. Will download to folder indicated with "resources_directory" parameter.
uniprot_columns list Columns to obtain information of through UniProt's ID mapping service. Options are available here.
download_cdd_resources boolean true if reCOGnizer should download resources, false otherwise.
recognizer_databases list Databases to obtain information from with reCOGnizer.
normalization_method string TMM (Trimmed Mean of the M-values), RLE (Relative Log Expression) or VSN (Variance Stabilizing Normalization) method to use for normalization.
minimum_differential_expression float DESeq2's differential comparison significance is calculated for a specific ratio. This sets the value for which the null hypothesis will be calculated.
metaproteomics_add_reference_proteomes boolean true if reference proteomes of identified taxa should be added to the MP database, false otherwise.
reference_proteomes_taxa_level string One of SUPERKINGDOM, PHYLUM, CLASS, ORDER, FAMILY, GENUS, SPECIES to download reference reference proteomes from for MP database.
use_crap boolean true if cRAP should be added to MP database, false otherwise.
proteomics_contaminants_database string To set a different contaminants database instead of cRAP.
protease string If Trypsin, will consider pig's trypsin as protease, and add Trypsin's sequence to MP database.
protease file string If not using Trypsin as protease, specify the path to the FASTA file of the protease.
false_discovery_rate float What local FDR to consider when selecting PSMs in protein inference.
keggcharter_taxa_level string The taxonomic level to represent with KEGGCharter. Choose one of SPECIES, GENUS, FAMILY, ORDER, CLASS, PHYLUM or SUPERKINGDOM
keggcharter_number_of_taxa int How many of the most abundant taxa should be represented with KEGGCharter, ideally set under 11
keggcharter_maps list IDs of metabolic maps where to chart information. Maps will be generated with genomic potential represented if MG data is available, and gene/protein expression if MT/MP data is available, respectively.