-
Notifications
You must be signed in to change notification settings - Fork 1
/
4_integration_clustering.qmd
229 lines (158 loc) · 6.89 KB
/
4_integration_clustering.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
title: "Integration and clustering"
engine: knitr
---
## Material
{{< downloadthis assets/pdf/03_analysis_ST.pdf dname="03_analysis_ST" label="Download the presentation" icon="filetype-pdf" >}}
## Dimensionality reduction
We load the required packages:
```{r}
#| output: false
library(Seurat)
library(ggplot2)
library(clustree)
library(patchwork)
library(dplyr)
```
And we load the list created after normalization and scaling, followed by a merge:
```{r}
seu_list <- readRDS("output/seu_part3.rds")
seu <- merge(seu_list[[1]], seu_list[[2]])
```
In order to perform dimensionality reduction, we first need to select variable features of both slices. To get a good representation of both slices, we take the intersect (i.e. genes that are variable in both slices):
```{r}
#| output: false
VariableFeatures(seu) <- intersect(VariableFeatures(seu_list$Anterior),
VariableFeatures(seu_list$Posterior))
```
::: {.callout-important}
## Exercise
How many variable features do we have? Why did we select fewer genes than the default (check `?VariableFeatures`)?
:::
::: {.callout-tip collapse="true"}
## Answer
Just by typing the object name we already see the number of variable features:
```{r}
seu
```
So, we have `r Seurat::VariableFeatures(seu) |> length()` variable features. Because we selected the features that are variable in both slices, it is fewer genes than the originally selected default.
:::
Now that we have selected the most variable features, we can generate a PCA based on the normalized and scaled data of those:
```{r}
seu <- RunPCA(seu, assay = "SCT", npcs = 50, verbose = FALSE)
DimPlot(seu, reduction = "pca", group.by = "orig.ident") +
scale_color_viridis_d(option = "cividis")
```
Based on the PCA, we can create a UMAP to get a representation of all 50 dimensions in a two dimensional space:
```{r}
#| warning: FALSE
#| message: FALSE
seu <- RunUMAP(seu, reduction = "pca", dims = 1:50)
DimPlot(seu, reduction = "umap", group.by = "orig.ident") +
scale_color_viridis_d(option = "cividis")
```
::: {.callout-important}
## Exercise
The two slices come from one brain, the posterior and anterior sides. Do you expect spots from similar cells/tissue in both slices? Is that represented in the UMAP?
:::
::: {.callout-tip collapse="true"}
## Answer
Yes, it is likely that there are similar spots in both slides, but the spots of both slices hardly overlap in the UMAP. Therefore, it makes sense to do an integration.
:::
## Integration
To integrate the two slices, we first need to select integration features. These are genes that are variable in both slices. We then prepare the data for integration, find the integration anchors (i.e. spots that are within each others neigbourhoods), and integrate the data:
::: {.callout-note}
You can safely ignore the warning: `Warning: Different cells and/or features from existing assay SCT`. See [this issue](https://github.com/satijalab/seurat/issues/7145).
:::
```{r}
#| output: false
features <- SelectIntegrationFeatures(seu_list)
seu_list <- PrepSCTIntegration(seu_list, anchor.features = features)
anchors <- FindIntegrationAnchors(
seu_list,
normalization.method = "SCT",
anchor.features = features
)
seu <- IntegrateData(anchors, normalization.method = "SCT")
```
::: {.callout-important}
## Exercise
How is the integrated data stored in the seurat object?
Hint: type `seu` to get an idea.
:::
::: {.callout-tip collapse="true"}
## Answer
```{r}
seu
```
We see that our object now has four assays, `integrated` (active), `Spatial`, `RNA` and `SCT`. The integrated data is stored in the `integrated` assay. We use this assay only for dimensionality reduction and clustering. When we go to marker gene identification, we use `SCT` again.
:::
Because we re-do the dimensionality reduction, we also again extract the variable features, run the PCA and the UMAP:
```{r}
#| warning: false
seu <- FindVariableFeatures(seu)
seu <- RunPCA(seu, npcs = 50, verbose = FALSE)
seu <- RunUMAP(seu, reduction = "pca", dims = 1:50)
DimPlot(seu, reduction = "umap") +
scale_color_viridis_d(option = "cividis")
```
## Identifying clusters
Seurat implements a graph-based clustering approach. Distances between the spots are calculated based on previously identified PCs. Briefly, Seurat identifies clusters of spots by a shared nearest neighbor (SNN) modularity optimization based clustering algorithm. First, it identifies k-nearest neighbors (KNN) and constructs the SNN graph. Then it optimizes the modularity function to determine clusters. For a full description of the algorithms, see Waltman and van Eck (2013) The European Physical Journal B.
The FindClusters function implements the procedure, and contains a resolution parameter that sets the granularity of the downstream clustering, with increased values leading to a greater number of clusters.
```{r}
resolution_vector <- seq(0.1,1,0.1)
seu <- FindNeighbors(seu, reduction = "pca", dims = 1:50)
seu <- FindClusters(object = seu,
resolution = resolution_vector,
verbose=FALSE)
```
Some new columns appeared in the metadata data frame after the clustering, each representing the cluster ID per spot for a given resolution:
```{r}
colnames(seu@meta.data)
```
To get an overview of the clustering over the different resolutions, we can use `clustree` to get an idea:
```{r}
clustree(seu, prefix = "integrated_snn_res.")
```
::: {.callout-important}
## Exercise
Which resolution would you choose for the clusters? If you have made up your mind, set the cluster column to the default identity with `SetIdent`.
:::
::: {.callout-tip collapse="true"}
## Answer
There is not 'true' clustering, but based on the `clustree` plot, it seems that after a resolution of 0.5, the clustering stays relatively stable:
```{r}
#| warning: false
res <- "integrated_snn_res.0.4"
seu <- SetIdent(seu, value = res)
```
:::
::: {.callout-warning}
The script below assumes that you have set object `res` to the column of your selected resolution, e.g.:
```{r}
#| eval: false
# this is not (necessarily) the correct answer to the previous question!
res <- "integrated_snn_res.0.8"
```
:::
Now that we have selected a resolution, we can color both the UMAP and the slices accordingly. First we defnie some appropriate colors, then we plot the UMAP with `DimPlot` and the slices with `SpatialPlot`.
```{r}
#| warning: false
#| message: false
# define a color palette based on the number of clusters
nclust <- seu[[res]] |> unique() |> nrow()
cluster_cols <- DiscretePalette(nclust, palette = "polychrome")
DimPlot(seu,
group.by = res,
shuffle = TRUE,
cols = cluster_cols)
SpatialPlot(seu, pt.size.factor = 2) +
plot_layout(guides='collect') &
theme(legend.position = "none") &
scale_fill_manual(values = cluster_cols)
```
After integration and clustering, we can save the output as an rds files:
```{r}
saveRDS(seu,
paste0("output/seu_part4.rds"))
```