Skip to content

Tutorial

javild edited this page Apr 13, 2016 · 17 revisions

Deprecated



Preliminars

CellBase comes with a command line interface (CLI) is written in Java, you will need at least Java 7 for running the CellBase CLI. After the installation you should have a cellbase/cellbase-build/installation-dir/ directory with the following structure:

/tmp/cellbase/cellbase-build/installation-dir/
├── bin
│   ├── cosmic
│   ├── ensembl-scripts
│   ├── genome-fetcher
│   ├── obsolete
│   └── protein
├── example
├── libs
└── mongodb-scripts

Tu run the CLI you must execute:

cd /tmp/cellbase/cellbase-build/installation-dir 
java -jar libs/cellbase-build-3.1.0.jar --help

Download data sources

A Python script was implemented which allows to download all data that may populate the CellBase database. This script is located at:

/tmp/cellbase/cellbase-build/installation-dir/bin/genome-fetcher/genome-fetcher.py

The script may be run by moving into /tmp/cellbase/cellbase-build/installation-dir/bin/genome-fetcher/ and launching it:

cd /tmp/cellbase/cellbase-build/installation-dir/bin/genome-fetcher/
./genome-fetcher.py --help

For example, in order to download data sources for the Human Gene, Genome sequence and Variation collections, execute:

./genome-fetcher.py -s "Homo sapiens" --sequence 1 --gene 1 
  --variation 1 -o /tmp`

This will download the data files into /tmp/homo_sapiens/ folder, with the following directory structure:

/tmp/homo_sapiens/
├── gene
│   ├── description.txt
│   ├── homo_sapiens.gtf.gz
│   ├── homo_sapiens.gtf.gz.log
│   └── xrefs.txt
├── sequence
│   ├── genome_info.json
│   ├── Homo_sapiens.GRCh38.fa.gz
│   └── Homo_sapiens.GRCh38.fa.gz.log
└── variation
    ├── attrib.txt.gz
    ├── attrib.txt.gz.log
    ├── attrib_type.txt.gz
    ├── attrib_type.txt.gz.log
    ├── motif_feature_variation.txt.gz
    ├── motif_feature_variation.txt.gz.log
    ├── phenotype_feature_attrib.txt.gz
    ├── phenotype_feature_attrib.txt.gz.log
    ├── phenotype_feature.txt.gz
    ├── phenotype_feature.txt.gz.log
    ├── phenotype.txt.gz
    ├── phenotype.txt.gz.log
    ├── seq_region.txt.gz
    ├── seq_region.txt.gz.log
    ├── source.txt.gz
    ├── source.txt.gz.log
    ├── structural_variation_feature.txt.gz
    ├── structural_variation_feature.txt.gz.log
    ├── study.txt.gz
    ├── study.txt.gz.log
    ├── transcript_variation.txt.gz
    ├── transcript_variation.txt.gz.log
    ├── variation_feature.txt.gz
    ├── variation_feature.txt.gz.log
    ├── variation_synonym.txt.gz
    ├── variation_synonym.txt.gz.log
    ├── variation.txt.gz
    └── variation.txt.gz.log

Building CellBase

Once we have downloaded the data we can build the Data Models for MongoDB by running the CLI. For example, for building genome sequence collection execute:

cd /tmp/cellbase/cellbase-build/installation-dir/
java -jar libs/cellbase-build-3.1.0.jar --build genome-sequence 
  --fasta-file /tmp/homo_sapiens/sequence/Homo_sapiens.GRCh38.fa.gz -o /tmp/

For building gene collection:

java -jar libs/cellbase-build-3.1.0.jar --build gene 
  --indir /tmp/homo_sapiens/gene 
  --fasta-file /tmp/homo_sapiens/sequence/Homo_sapiens.GRCh38.fa.gz -o /tmp/

For building variation collections:

java -jar libs/cellbase-build-3.1.0.jar --build variation 
  --indir /tmp/homo_sapiens/variation -o /tmp/

JSON files will be created at /tmp after each of these command lines, e.g.:

/tmp/genome_sequence.json

Installing the database

MongoDB 2.6 is at least required for loading the JSON files created in the previous step. Mongo databases and collections can be easily loaded by using the mongoimport command, e.g.:

mongoimport --file /tmp/genome_sequence.json -d hsapiens_cb_v3 
  -c genome_sequence`