Skip to content
Wei-ju Wu edited this page Sep 10, 2013 · 15 revisions

cMonkey uses a number of input files to support its built in scoring functions.

Note: cMonkey-Python uses a cache directory for data files that are downloaded from the web. You can use this mechanism can be used for specifying your own data, as long as it is in the same format as specified below.

The path is by default cache, you can also specify a different location by providing the --cachedir option on the command line

Gene expressions (mandatory)

The most basic and mandatory input is gene expression data. This should be specified as a tab-delimited text file, with the following specifications for a matrix with n genes and m conditions:

  • first row: 1 dummy label + m condition titles
  • n rows where each row starts with a gene name and m expression ratio values

The expression file can either be uncompressed or in gzip format (in this case, it should have a .gz suffix).

Example:

  X<TAB>Cond 1<TAB>Cond 2
  Gene1<TAB>1.213<TAB>-1.412
  ...

RSAT information

The RSAT database is central to cMonkey's automatic retrieval of organism information. It is used for several purposes:

  • organism classification (eukaryote/prokaryote, NCBI taxonomy mapping)
  • gene synonyms
  • genomic information

In general, RSAT data files can contain any number of comments that start with "--".

The user can specify the RSAT organism explicitly via the --rsat_organism parameter on the command line, otherwise, it will attempt to derive the name through the mandatory KEGG organism code.

RSAT organism file

The contents of this file is only used to look for the occurrence of the word "Eukaryota". In this case, the organism will be marked as a eukaryote, otherwise, it will be assumed as prokaryote.

Example (file name "rsat_Halobacterium_sp"):

64091<TAB>Archaea; Euryarchaeota; Halobacteria; Halobacteriales; Halobacteriaceae; Halobacterium

RSAT names file

cMonkey uses this file to determine the NCBI taxonomy id of the organism. To provide a user-defined names file, it should have the name rsatnames_<rsat organism name>. Only one line is needed in the format

<NCBI code><TAB><RSAT organism name><TAB>primary

Example (file name "rsatnames_Halobacterium_sp"):

64091<TAB>Halobacterium sp. NRC-1<TAB>primary

RSAT feature name files

These files are used by cMonkey to create the synonym table for alternative gene names. It should have the name <RSAT organism name>_feature_names and each line has the format

<Accession ID><TAB><name><primary|alternate>

Example (file name "Halobacterium_sp_feature_names"):

  NP_045946.1<TAB>VNG7001<TAB>primary
  NP_045946.1<TAB>1446803<TAB>alternate
  NP_045946.1<TAB>10803548<TAB>alternate
  NP_045946.1<TAB>NP_045946.1<TAB>alternate
  ...

RSAT feature files

Each line in a feature file contains a gene location for an organism. The important information is are the id, name, contig, strand, start and end position. For each contig listed in this file, there must exist a corresponding contig file.

Example (file name "Halobacterium_sp_features"):

-- id<TAB>type<TAB>name<TAB>contig<TAB>start<TAB>end<TAB>strand<TAB>description<TAB>chrom_pos<TAB>organism<TAB>GeneID
NP_045946.1<TAB>CDS<TAB>VNG7001<TAB>NC_001869.1<TAB>363<TAB>812<TAB>R<TAB>hypothetical protein<TAB>complement(363..812)<TAB>Halobacterium sp. NRC-1<TAB>1446803
NP_045947.1<TAB>CDS<TAB>VNG7002<TAB>NC_001869.1<TAB>834<TAB>1172<TAB>R<TAB>hypothetical protein<TAB>complement(834..1172)<TAB>Halobacterium sp. NRC-1<TAB>1446804
...

RSAT contig files

These files contain the raw genomic sequence in lower case for a specific contig/chromosome that is referenced in the RSAT features file.

Example (file name "Halobacterium_sp_NC002607.1"):

ttgacccactgaatcacgtctgaccgcgcgtacgcggtcacttgcggtgccgttttctttgttaccgacgaccgaccagcgacagccaccgcgcgctcactgccaccaaaagagtcatatcacagccgaccagtttctggaacgttcccgatactggaacggtcctaatgcagtatcccaccctccttccatcgacgccagtcgaatcacgccgccagccaccgtccgccagccggccagaataccgatgactcggcggtctcgtgtcggtgccggcctcgcagccattgtactggccctggccgcagtgtcggctgccgctcc

STRING protein-protein interactions

Microbes Online operon files

Clone this wiki locally