-
Notifications
You must be signed in to change notification settings - Fork 16
Input file formats
cMonkey uses a number of input files to support its built in scoring functions.
Note: cMonkey-Python uses a cache directory for data files that are downloaded from the web. You can use this mechanism can be used for specifying your own data, as long as it is in the same format as specified below.
The path is by default
cache
, you can also specify a different location by providing the--cachedir
option on the command line
The most basic and mandatory input is gene expression data. This should be specified as a tab-delimited text file, with the following specifications for a matrix with n genes and m conditions:
- first row: 1 dummy label + m condition titles
- n rows where each row starts with a gene name and m expression ratio values
The expression file can either be uncompressed or in gzip format (in this case, it should have a .gz suffix).
The RSAT database is central to cMonkey's automatic retrieval of organism information. It is used for several purposes:
- organism classification (eukaryote/prokaryote, NCBI taxonomy mapping)
- gene synonyms
- genomic information
The user can specify the RSAT organism explicitly via the --rsat_organism
parameter on the command line, otherwise, it will attempt to derive the name through the mandatory KEGG organism code.
The contents of this file is only used to look for the occurrence of the word "Eukaryota". In this case, the organism will be marked as a eukaryote, otherwise, it will be assumed as prokaryote.
cMonkey uses this file to determine the NCBI taxonomy id of the organism. To provide a user-defined names file, it should have the name rsatnames_<rsat organism name>
.
Only one line is needed in the format
<NCBI code><TAB><RSAT organism name>
These files are used by cMonkey to create the synonym table for alternative gene names. It should have the name <RSAT organism name>_feature_names
and each line has the format
<Accession ID><TAB><name><primary|alternate>