-
Notifications
You must be signed in to change notification settings - Fork 53
Tutorial
CellBase comes with a command line interface (CLI) is written in Java, you will need at least Java 7 for running the CellBase CLI. After the installation you should have a cellbase/cellbase-build/installation-dir/ directory with the following structure:
/tmp/cellbase/cellbase-build/installation-dir/
├── bin
│ ├── cosmic
│ ├── ensembl-scripts
│ ├── genome-fetcher
│ ├── obsolete
│ └── protein
├── example
├── libs
└── mongodb-scripts
Tu run the CLI you must execute:
cd /tmp/cellbase/cellbase-build/installation-dir
java -jar libs/cellbase-build-3.1.0.jar --help
A Python script was implemented which allows to download all data that may populate the CellBase database. This script is located at:
/tmp/cellbase/cellbase-build/installation-dir/bin/genome-fetcher/genome-fetcher.py
The script may be run by moving into /tmp/cellbase/cellbase-build/installation-dir/bin/genome-fetcher/ and launching it:
cd /tmp/cellbase/cellbase-build/installation-dir/bin/genome-fetcher/
./genome-fetcher.py --help
For example, in order to download data sources for the Human Gene, Genome sequence and Variation collections, execute:
./genome-fetcher.py -s "Homo sapiens" --sequence 1 --gene 1
--variation 1 -o /tmp`
This will download the data files into /tmp/homo_sapiens/ folder, with the following directory structure:
/tmp/homo_sapiens/
├── gene
│ ├── description.txt
│ ├── homo_sapiens.gtf.gz
│ ├── homo_sapiens.gtf.gz.log
│ └── xrefs.txt
├── sequence
│ ├── genome_info.json
│ ├── Homo_sapiens.GRCh38.fa.gz
│ └── Homo_sapiens.GRCh38.fa.gz.log
└── variation
├── attrib.txt.gz
├── attrib.txt.gz.log
├── attrib_type.txt.gz
├── attrib_type.txt.gz.log
├── motif_feature_variation.txt.gz
├── motif_feature_variation.txt.gz.log
├── phenotype_feature_attrib.txt.gz
├── phenotype_feature_attrib.txt.gz.log
├── phenotype_feature.txt.gz
├── phenotype_feature.txt.gz.log
├── phenotype.txt.gz
├── phenotype.txt.gz.log
├── seq_region.txt.gz
├── seq_region.txt.gz.log
├── source.txt.gz
├── source.txt.gz.log
├── structural_variation_feature.txt.gz
├── structural_variation_feature.txt.gz.log
├── study.txt.gz
├── study.txt.gz.log
├── transcript_variation.txt.gz
├── transcript_variation.txt.gz.log
├── variation_feature.txt.gz
├── variation_feature.txt.gz.log
├── variation_synonym.txt.gz
├── variation_synonym.txt.gz.log
├── variation.txt.gz
└── variation.txt.gz.log
Once we have downloaded the data we can build the Data Models for MongoDB by running the CLI. For example, for building genome sequence collection execute:
cd /tmp/cellbase/cellbase-build/installation-dir/
java -jar libs/cellbase-build-3.1.0.jar --build genome-sequence
--fasta-file /tmp/homo_sapiens/sequence/Homo_sapiens.GRCh38.fa.gz -o /tmp/
For building gene collection:
java -jar libs/cellbase-build-3.1.0.jar --build gene
--indir /tmp/homo_sapiens/gene
--fasta-file /tmp/homo_sapiens/sequence/Homo_sapiens.GRCh38.fa.gz -o /tmp/
For building variation collections:
java -jar libs/cellbase-build-3.1.0.jar --build variation
--indir /tmp/homo_sapiens/variation -o /tmp/
JSON files will be created at /tmp after each of these command lines, e.g.:
/tmp/genome_sequence.json
MongoDB 2.6 is at least required for loading the JSON files created in the previous step. Mongo databases and collections can be easily loaded by using the mongoimport command, e.g.:
mongoimport --file /tmp/genome_sequence.json -d hsapiens_cb_v3
-c genome_sequence`