Skip to content
Wei-ju Wu edited this page Sep 10, 2013 · 15 revisions

cMonkey uses a number of input files to support its built in scoring functions.

Note: cMonkey-Python uses a cache directory for data files that are downloaded from the web. You can use this mechanism can be used for specifying your own data, as long as it is in the same format as specified below.

The path is by default cache, you can also specify a different location by providing the --cachedir option on the command line

Gene expressions (mandatory)

The most basic and mandatory input is gene expression data. This should be specified as a tab-delimited text file, with the following specifications for a matrix with n genes and m conditions:

  • first row: 1 dummy label + m condition titles
  • n rows where each row starts with a gene name and m expression ratio values

The expression file can either be uncompressed or in gzip format (in this case, it should have a .gz suffix).

Example:

  X<TAB>Cond 1<TAB>Cond 2
  Gene1<TAB>1.213<TAB>-1.412
  ...

RSAT information

The RSAT database is central to cMonkey's automatic retrieval of organism information. It is used for several purposes:

  • organism classification (eukaryote/prokaryote, NCBI taxonomy mapping)
  • gene synonyms
  • genomic information

The user can specify the RSAT organism explicitly via the --rsat_organism parameter on the command line, otherwise, it will attempt to derive the name through the mandatory KEGG organism code.

Example: TODO

RSAT organism file

The contents of this file is only used to look for the occurrence of the word "Eukaryota". In this case, the organism will be marked as a eukaryote, otherwise, it will be assumed as prokaryote.

Example: TODO

RSAT names file

cMonkey uses this file to determine the NCBI taxonomy id of the organism. To provide a user-defined names file, it should have the name rsatnames_<rsat organism name>. Only one line is needed in the format

<NCBI code><TAB><RSAT organism name>

Example: TODO

RSAT feature name files

These files are used by cMonkey to create the synonym table for alternative gene names. It should have the name <RSAT organism name>_feature_names and each line has the format

<Accession ID><TAB><name><primary|alternate>

Example: TODO

RSAT feature files

RSAT sequence files

STRING protein-protein interactions

Microbes Online operon files

Clone this wiki locally