Skip to content

Latest commit

 

History

History
68 lines (54 loc) · 2.3 KB

datasets.adoc

File metadata and controls

68 lines (54 loc) · 2.3 KB

Data Description

Genome assembly

genome.fa

The human genome assembly hg19 (GRCh37) from GenBank, chromosome 22 only.

RNA-seq reads

ENCSR000COQ[12]_[12].fastq.gz

The RNA-seq data comes from the human GM12878 cell line from whole cell, cytosol and nucleous extraction (see table below).

The libraries are stranded PE76 Illumina GAIIx RNA-Seq from rRNA-depleted Poly-A+ long RNA (> 200 nucleotides in size).

Only reads mapped to the 22q11 locus of the human genome (chr22:16000000-18000000) are used.

ENCODE ID Cellular fraction replicate ID file names

ENCSR000COQ

Whole Cell

1

 ENCSR000COQ1_1.fastq.gz
ENCSR000COQ1_2.fastq.gz

2

 ENCSR000COQ2_1.fastq.gz
ENCSR000COQ2_2.fastq.gz

ENCSR000CPO

Nuclear

1

ENCSR000CPO1_1.fastq.gz
ENCSR000CPO1_2.fastq.gz

2

ENCSR000CPO2_1.fastq.gz
ENCSR000CPO2_2.fastq.gz

ENCSR000COR

Cytosolic

1

ENCSR000COR1_1.fastq.gz
ENCSR000COR1_1.fastq.gz

2

ENCSR000COR2_1.fastq.gz
ENCSR000COR2_1.fastq.gz
"Known" variants

known_variants.vcf.gz

Known variants come from high confident variant calls for GM12878 from the Illumina Platinum Genomes project. These variant calls were obtained by taking into account pedigree information and the concordance of calls across different methods.

We’re using the subset from chromosome 22 only.

Blacklisted regions

blacklist.bed

Blacklisted regions are regions of the genomes with anomalous coverage. We use regions for the hg19 assembly, taken from the ENCODE project portal. These regions were identified with DNAse and ChiP-seq samples over ~60 human tissues/cell types, and had a very high ratio of multi-mapping to unique-mapping reads and high variance in mappability.