-
Notifications
You must be signed in to change notification settings - Fork 5
/
QC.Rmd
351 lines (269 loc) · 12.5 KB
/
QC.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
---
output:
html_document:
keep_md: yes
number_sections: yes
toc: yes
---
```{r echo=FALSE}
knitr::opts_chunk$set(cache = TRUE)
knitr::opts_knit$set(verbose = TRUE)
# knitr::opts_chunk$set(dev=c('svg', 'png'))
options(width = 100)
ggplot2::theme_set(ggplot2::theme_bw())
```
Sample C1 CAGE libraries comparing PrimeScript to SuperScript III
=================================================================
This is a preview of the [C1 CAGE][] technology. We made two C1 runs differing
by the RT enzyme used: PrimeScript or SuperScript III, and sequenced them in
multiplex on a MiSeq v2 kit.
[C1 CAGE]: https://www.fluidigm.com/c1openapp/scripthub/script/2015-07/c1-cage-1436761405138-3
__Warning: this is unpublished results to show what C1 CAGE libraries look like.
A proper enzyme benchmark needs replicates, but this is out of the scope of
this document.__
To quickly assess the performance of C1 CAGE (standard condition: SuperScript III),
follow the links for the [promoter rate](#annotation-of-transcript-counts-per-run)
and [gene discovery curves](#rarefaction-hanabi-plot).
Metadata
--------
```{r library_information}
LIBRARY <- "150519_M00528_0125_000000000-ACUAB"
RunA <- "1772-066-262"
RunB <- "1772-066-263"
ctrls <- list( RunA=list(posi="C01", nega="D01")
, RunB=list(posi="C03", nega="H09"))
```
`LIBRARY` indicates the MiSeq run ID (`r LIBRARY`), `RunA` and `B` indicate
the C1 array serial numbers (`r RunA` and `r RunB`), and are used as C1 run
IDs. The position of the positive controls (wells `r ctrls$RunA$posi` and
`r ctrls$RunB$posi`) and negative controls (wells `r ctrls$RunA$nega` and
`r ctrls$RunB$nega`) in the 96-well plate is also indicated.
Annotation and gene symbols
---------------------------
This [knitr](http://yihui.name/knitr/) file runs a few shell commands before
starting the analysis in `R`, to:
- convert the single-base-resolution ("_level1_"") [OSC Table](http://sourceforge.net/projects/osctf/)
to BED format,
- intersect this data to a BED file containing annotations (_promoters_,
_exons_, ...), that was created by [parsing GENCODE 23](https://gist.github.com/charles-plessy/9dbc8bc98fb773bf71b6),
- intersect with gene symbols using the same method.
```{r annotate_bed, engine="bash"}
function osc2bed {
zcat $1 |
grep -v \# |
sed 1d |
awk '{OFS="\t"}{print $2, $3, $4, "l1", "1000", $5}'
}
function bed2annot {
bedtools intersect -a $1 -b gencode.v23.annotation.bed -s -loj |
awk '{OFS="\t"}{print $1":"$2"-"$3$6,$10}' |
bedtools groupby -g 1 -c 2 -o collapse
}
function bed2symbols {
bedtools intersect -a $1 -b gencode.v23.annotation.genes.bed -s -loj |
awk '{OFS="\t"}{print $1":"$2"-"$3$6,$10}' |
bedtools groupby -g 1 -c 2 -o distinct
}
osc2bed output/mylevel1file.l1.osc.gz | tee level1.bed | bed2annot - > level1.annot
bed2symbols level1.bed > level1.genes
```
`R` packages needed in this analysis
------------------------------------
We use `oscR` to load expression data in a `data.table`, `smallCAGEqc` for a
lot of ad-hoc functions needed in our projects, `magrittr` for the `%>%` pipe
operator, `ggplot2` for figures and `vegan` for subsample analysis.
[oscR](https://github.com/charles-plessy/oscR) and
[smallCAGEqc](https://github.com/charles-plessy/smallCAGEqc) are available
from GitHub, see the instructions on their home pages. This Rmarkdown file
is derived from the `nanoCAGE` template distributed in the smallCAGEqc package.
```{r load_R_libraries, message=F}
library(oscR)
library(smallCAGEqc)
stopifnot(
packageVersion("oscR") >= "0.2.0"
, packageVersion("smallCAGEqc") > "0.11.2"
)
library(data.table)
library(magrittr)
library(reshape)
library(ggplot2)
library(vegan)
```
Per-sample metadata table in `R`
--------------------------------
Here, we store the sample metadata in a table called `libs`, where rows
represent single cell libraries (hence the name), and columns are a source
of metadata like error codes, number of extracted reads, gene count, etc.
The `libs` table is constructed from a processing summary file where each
line is a tabulation-separated triple giving metadata for one sample. For
instance, the line below indicates that cell A01 from run A had 5,855 counts:
transcript_count RunA_A01 5855
```{r load_logs}
setwd("output")
libs <- loadLogs('logs') %>% llPostProcess('nano-fluidigm')
setwd("..")
libs %<>% within(tagdust <- extracted - cleaned - rdna - spikes)
```
### Cell picture QC
Cell pictures were taken on a Cellomics platform and curated as described in our
[Cell-Cycle-on-C1](https://github.com/Population-Transcriptomics/Cell-Cycle-on-C1/blob/master/fluorescence/Fluorescence-measured-in-ImageJ.md)
project. Results (first pass, no cross-checking) are stored in this directory
in the files [1772-066-262.imageQC.txt](./1772-066-262.imageQC.txt) and
[1772-066-263.imageQC.txt](./1772-066-263.imageQC.txt). The metadata are loaded
in `R` and merged in the `libs` table. The cells were stained with live (channel 2) /
dead (channel 3) dyes.
```{r load_fluo_data}
read.fluo <- function(RUN) read.delim( paste0(RUN, ".imageQC.txt")
, row.names="cell_id"
, stringsAsFactors = FALSE)
fluo <- rbind(read.fluo(RunA), read.fluo(RunB))
libs <- cbind(libs, subset(fluo,,c("mean_ch2", "bg_mean_ch2", "mean_ch3", "bg_mean_ch3", "Error", "Comment")))
libs$Error <- factor( libs$Error
, levels=c("0-No Error", "1-No cell", "2-Debris", "3-OutOfFocus", "4-MultipleCells", "5-Control", "6-Dead"))
```
A hardcoded threshold of 2.5 is used to identify dead cells. The histogram
below is to check if this value makes sense in this dataset. Cells above
the threshold are flagged _dead_.
```{r dead-cells, fig.height=2.5}
hist( libs$mean_ch3 - libs$bg_mean_ch3
, br = 100
, main = "Current threshold for identifying dead cells"
, xlab = "mean_ch3 - bg_mean_ch3" )
deadThresh <- 2.5
abline(v=deadThresh, col="red")
libs[libs$mean_ch3 - libs$bg_mean_ch3 > deadThresh, "Error"] <- "6-Dead"
```
### Positive and negative controls
Some samples with errors were replaced by the positive and negative controls.
The following commands update the `libs` table to reflect this.
```{r flag-controls}
libs[libs$Well %in% ctrls[["RunA"]] & libs$Run == RunA, "Error"] <- "5-Control"
libs[libs$Well %in% ctrls[["RunB"]] & libs$Run == RunB, "Error"] <- "5-Control"
libs[libs$Well == ctrls$RunA$posi & libs$Run == RunA, "Comment"] <- "Positive control"
libs[libs$Well == ctrls$RunB$posi & libs$Run == RunB, "Comment"] <- "Positive control"
libs[libs$Well == ctrls$RunA$nega & libs$Run == RunA, "Comment"] <- "Negative control"
libs[libs$Well == ctrls$RunB$nega & libs$Run == RunB, "Comment"] <- "Negative control"
```
### cDNA concentration
Excel templates provided by Fluidigm were used for fluorometric measurement of cDNA
yield with the PicoGreen dye, and are parsed with the following commands. A histogram
of the cDNA yields is plotted. The distribution is quite even because in these libraries
we use a large quantity of spikes.
```{r cDNA_concentration, fig.height=2.5, message=FALSE, warning=FALSE}
read.pg <- function(RUN)
paste0(RUN, ".picogreen.xlsx") %>%
fldgmPicoGreen("PN 100-6160") %>%
extract(,"concentration")
libs$Concentration <- c(read.pg(RunA), read.pg(RunB))
fldgmConcentrationPlot(libs)
```
Transcript counts data
----------------------
### Data load and quantification
The main output of C1 CAGE single-nucleotide resolution molecule counts
at transcription start sites, here in a _level1_ OSC Table, that is
read as a _data.table_ by the `fread.osc` command of the `oscR` package.
After loading, let's count the number of detected TSS per sample, and
store the result in the `libs` table.
```{r load_osc_data}
l1 <- fread.osc("output/mylevel1file.l1.osc.gz", dropIdCoords=TRUE)
setnames(l1, colnames(l1) %>% sub('raw.', '', .) %>% sub('.None', '', .) %>% sub('RunA',RunA,.) %>% sub('RunB',RunB,.) %>% sub("_R1", "", .))
libs$l1 <- colSums(l1 > 0)
```
### Annotation
Using the files created earlier by the shell commands and the `hierarchAnnot` command,
annotate the TSS, and add a summary to the `libs` command.
```{r annotate_l1}
annot.l1 <- read.table("level1.annot", head=F, col.names=c('id', 'feature'), row.names=1)
annot.l1 <- hierarchAnnot(annot.l1)
libs <- cbind(libs, t(rowsum(l1, annot.l1[,'class']))[rownames(libs),])
libs$group <- libs$Error
```
#### Annotation statistics for each run, grouped by error codes from the image QC
Roughly 10 % only of the reads align to ribosomal RNA. A large number of reads are
PCR duplicates (here in the category _mapped_) in the sense that they have the same
transcription start site and unique molecular identifier. These PCR duplicates are
not lost as they are used for paired-end assembly of _CAGEscan fragments_ where each
independent tagmentation event reconstitutes the original cDNA sequence.
```{r plotAnnotAll, warning=FALSE, fig.height=20}
plotAnnot(libs, 'step', "All cells", libs$samplename)
```
```{r plotAnnotPS, warning=FALSE}
plotAnnot(libs[libs$Run==RunA,], 'step', RunA)
```
```{r plotAnnotSSIII, warning=FALSE}
plotAnnot(libs[libs$Run==RunB,], 'step', RunB)
```
```{r plotAnnotQcAll, warning=FALSE, fig.height=20}
plotAnnot(libs, 'qc', "All cells", libs$samplename)
```
```{r plotAnnotQcPS, warning=FALSE}
plotAnnot(libs[libs$Run==RunA,], 'qc', RunA)
```
```{r plotAnnotQcSSIII, warning=FALSE}
plotAnnot(libs[libs$Run==RunB,], 'qc', RunB)
```
#### Annotation of transcript counts, per run
In proportion of unique molecule counts, the number of counts for promoter
regions is larger than 50 %.
```{r plotAnnotCountsRuns, warning=FALSE}
plotAnnot(libs[libs$Error == "0-No Error",], "counts", 'Transcript counts', libs[libs$Error == "0-No Error", "Run"])
```
Gene counts and expression levels
---------------------------------
### Gene symbols
Gene expression level is calculated by intersecting the TSS with known genes.
The number of detected genes per cell is counted and summarised in the `libs` table.
```{r genesymbols_l1}
genesymbols <- read.table("level1.genes", col.names=c("cluster","symbol"), stringsAsFactors=FALSE)
rownames(genesymbols) <- genesymbols$cluster
genes <- rowsum(l1, genesymbols$symbol)
libs$genes <- countSymbols(genes)
```
### Correlation between runs
Here, we observe a strong difference between the runs.
```{r heatmapCluster}
NMF::aheatmap(cor(genes[-1 ,libs$Error == "0-No Error"]), annCol=list(Run=libs[libs$Error == "0-No Error", "Run"]))
```
### Gene count by error code and run
Obviously, we detect more genes if there were multiple cells...
```{r geneCount, fig.height=2.5}
dotsize <- 50
ggplot(libs, aes(x=Error, y=genes)) +
stat_summary(fun.y=mean, fun.ymin=mean, fun.ymax=mean, geom="crossbar", color="gray") +
geom_dotplot(aes(fill=Error), binaxis='y', binwidth=1, dotsize=dotsize, stackdir='center') +
coord_flip() + facet_wrap(~Run)
```
```{r geneBoxplot}
t.test(data=subset(libs, Error == "0-No Error"), genes ~ Run)
```
The number of genes detected in `r RunA` is significantly lower than in `r RunB`.
But could it be explained by sampling depth ? Or would the trend be
reversed if we would sequence more ? We answer this question by plotting
how many genes are discovered as the number of reads per sample increases.
### Rarefaction (hanabi plot)
Since we count unique transcripts with the _unique molecular identifiers_ in C1
CAGE's template-switching oligonucleotides, here we study the increase of gene
discovery per transcript count, instead of per sequence read.
```{r hanabi}
hanabiPlot( hanabi( genes[,libs$Error=="0-No Error",]
, from=0)
, ylab='number of genes detected'
, xlab='number of unique molecule counts'
, main='Gene discovery'
, GROUP=libs[libs$Error=="0-No Error", "Run"])
```
The plots suggest that deeper (HiSeq) sequencing of the `r RunA` library
(PrimeScript) would not increase much the number of detected genes, which already
saturates. However, the libraries made with SuperScript III (`r RunB`) clearly
have the potential to deliver more information by deep sequencing.
Conclusion
----------
We hope that this document will give you a rough idea on how to pre-process and
quality-control C1 CAGE data in R.
Using Rstudio, you can run the `R` code piece by piece (but you will have to run
the shell commands by hand), to better study how the data is processed in this
example.
We use a comparison of two reverse-transcriptases as
an supporting example, but a final conclusion on this particular question
requires analysis of replicates, which is out of the scope here.