Data Description

Genome assembly

genome.fa

The human genome assembly hg19 (GRCh37) from GenBank, chromosome 22 only.

RNA-seq reads

ENCSR000COQ[12]_[12].fastq.gz

The RNA-seq data comes from the human GM12878 cell line from whole cell, cytosol and nucleous extraction (see table below).

The libraries are stranded PE76 Illumina GAIIx RNA-Seq from rRNA-depleted Poly-A+ long RNA (> 200 nucleotides in size).

Only reads mapped to the 22q11 locus of the human genome (chr22:16000000-18000000) are used.

ENCODE ID	Cellular fraction	replicate ID	file names
ENCSR000COQ	Whole Cell	1	ENCSR000COQ1_1.fastq.gz ENCSR000COQ1_2.fastq.gz
ENCSR000COQ	Whole Cell	2	ENCSR000COQ2_1.fastq.gz ENCSR000COQ2_2.fastq.gz
ENCSR000CPO	Nuclear	1	ENCSR000CPO1_1.fastq.gz ENCSR000CPO1_2.fastq.gz
ENCSR000CPO	Nuclear	2	ENCSR000CPO2_1.fastq.gz ENCSR000CPO2_2.fastq.gz
ENCSR000COR	Cytosolic	1	ENCSR000COR1_1.fastq.gz ENCSR000COR1_1.fastq.gz
ENCSR000COR	Cytosolic	2	ENCSR000COR2_1.fastq.gz ENCSR000COR2_1.fastq.gz

"Known" variants

known_variants.vcf.gz

Known variants come from high confident variant calls for GM12878 from the Illumina Platinum Genomes project. These variant calls were obtained by taking into account pedigree information and the concordance of calls across different methods.

We’re using the subset from chromosome 22 only.

Blacklisted regions

blacklist.bed

Blacklisted regions are regions of the genomes with anomalous coverage. We use regions for the hg19 assembly, taken from the ENCODE project portal. These regions were identified with DNAse and ChiP-seq samples over ~60 human tissues/cell types, and had a very high ratio of multi-mapping to unique-mapping reads and high variance in mappability.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets.adoc

datasets.adoc

Data Description

Files

datasets.adoc

Latest commit

History

datasets.adoc

File metadata and controls

Data Description