-
Notifications
You must be signed in to change notification settings - Fork 4
Configure and run MOSCA
MOSCA allows you to customize your analysis through a configuration file, written in JSON format. This configuration file specifies the parameters for the pipeline, such as input data and parameters of analysis.
To obtain the configuration file, you can visit MOSGUITO and follow these steps:
-
Go to "Configurations" -> "General configurations" and select the general parameters of analysis. Default values are already in place, and might not need to be changed.
-
The following three tabs - "UniProt columns", "UniProt databases" and "KEGG metabolic maps" - concern the selection of lists of columns and databases for which to obtain information from UniProt, and metabolic maps for which to represent genomic potential and gene expression.
-
Set, in the "Experiments" tab, the information concerning your datasets.
- "Files" - name of the files in the system where MOSCA will be run. Can link to folders, and can contain spaces. If those are short reads in paired-end format, input their filenames separated by a comma (",");
- "Sample" - files with the same "Sample" value are considered to belong to similar communities, and will be assembled and binned together. This allows to improve the quality of assembly/binning, but is not advised if communities are too different;
- "Data type" - "dna", "mrna" or "protein" - metagenomics, metatranscriptomics or metaproteomics - Whole Genome Shotgun or RNA-Seq Sequencings short reads, or LC-MS raw data (spectra);
- "Condition" - determines replicates for differential expression analysis - datasets with the same condition are considered replicates of each other. Only need to set it for "mrna" and "protein" datasets;
- "Name" - the name to output results concerning the dataset. If left blank, MOSCA will auto-determine a name.
-
You can then save the file to your local machine by going back to "General configuration" and clicking "Download JSON".
These are all the parameters available for configuration in MOSCA. It might not be needed to configure all of them (proteomics parameters are useless when not inputting proteomics data) but MOSCA will check for them, so they must be kept.
Parameter | Value | Description |
---|---|---|
version | string | Minimum version of MOSCA to use the config file. This information is inputted automatically by MOSGUITO. |
output | string | Name of folder where MOSCA's results will be stored (if it doesn't exist, it will be created). |
resources_directory | string | Name of folder to store databases and other resources for MOSCA. |
experiments | Dictionary | With the fields "Files", "Sample", "Data type", "Condition" and "Name", as per the file in the repo. |
threads | int | Number of maximum threads for MOSCA to use. |
minimum_read_length | int | Minimum length of reads to keep. |
minimum_read_average_quality | int | Minimum average quality of reads to keep. |
max_memory | int | Max memory that MOSCA can use (in Gb). |
do_assembly | boolean |
true if MOSCA should do assembly, false otherwise. |
assembler | string | Name of assembler to use for iterative co-assembly of MG data. Possible values: metaspades , megahit
|
error_model | string | Name of error model for gene calling with FragGeneScan. Possible values are: sanger_5 , sanger_10 , 454_10 , 454_30 , illumina_5 , illumina_10 if either Sanger, pyro- or Illumina sequencing reads are the input to gene calling. The term before the _ denote the technology used, and the term after denote the percentage of reads that are expected to be wrongly called. Set to complete if assembly was performed. |
markerset | string | Name of markerset to use for completeness/contamination estimation with CheckM over the contigs obtained with MaxBin2. 40 if Archaea are significantly present in the community, 107 if the study only concerns Bacteria. |
do_binning | boolean |
true if binning is to be performed, false otherwise. |
do_iterative_binning | boolean |
true if iterative binning is to be performed, false otherwise. |
split_gene_calling | int | In how many parts should the result of gene calling be split. Used to save disk space when DIAMOND is running, only needed if disk runs out of space. |
upimapi_database | string | Name of FASTA or DMND (DIAMOND formatted database) file to use as input for annotation with DIAMOND. If download UniProt parameter is "true", MOSCA will use downloaded UniProt database instead. |
upimapi_max_target_seqs | int | Number of matches to report for each protein from annotation with DIAMOND. |
upimapi_taxids | string | Comma-separated list of Tax IDs to build "taxids" database of UPIMAPI. It should contain all the Tax IDs of taxonomies present in datasets. |
upimapi_check_db | boolean |
true if UniProtKB (SwissProt + TrEMBL) is to be downloaded, false if it is already downloaded. Will download to folder indicated with "resources_directory" parameter. |
uniprot_columns | list | Columns to obtain information of through UniProt's ID mapping service. Options are available here. |
download_cdd_resources | boolean |
true if reCOGnizer should download resources, false otherwise. |
recognizer_databases | list | Databases to obtain information from with reCOGnizer. |
normalization_method | string |
TMM (Trimmed Mean of the M-values), RLE (Relative Log Expression) or VSN (Variance Stabilizing Normalization) method to use for normalization. |
minimum_differential_expression | float | DESeq2's differential comparison significance is calculated for a specific ratio. This sets the value for which the null hypothesis will be calculated. |
metaproteomics_add_reference_proteomes | boolean |
true if reference proteomes of identified taxa should be added to the MP database, false otherwise. |
reference_proteomes_taxa_level | string | One of SUPERKINGDOM , PHYLUM , CLASS , ORDER , FAMILY , GENUS , SPECIES to download reference reference proteomes from for MP database. |
use_crap | boolean |
true if cRAP should be added to MP database, false otherwise. |
proteomics_contaminants_database | string | To set a different contaminants database instead of cRAP. |
protease | string | If Trypsin , will consider pig's trypsin as protease, and add Trypsin's sequence to MP database. |
protease file | string | If not using Trypsin as protease, specify the path to the FASTA file of the protease. |
false_discovery_rate | float | What local FDR to consider when selecting PSMs in protein inference. |
keggcharter_taxa_level | string | The taxonomic level to represent with KEGGCharter. Choose one of SPECIES , GENUS , FAMILY , ORDER , CLASS , PHYLUM or SUPERKINGDOM
|
keggcharter_number_of_taxa | int | How many of the most abundant taxa should be represented with KEGGCharter, ideally set under 11 |
keggcharter_maps | list | IDs of metabolic maps where to chart information. Maps will be generated with genomic potential represented if MG data is available, and gene/protein expression if MT/MP data is available, respectively. |