This page explains some advanced usage options for schist
.
schist
is based on Nested Stochastic Block Models (NSBM), a generative process based on the notion of group of nodes. Here, we use NSBM to cluster cells in scRNA-seq experiments.
Cell to cell relations are commonly represented as a neighborhood graph (typically KNN or SNN), cell groups are identified as graph communities, this step is usually performed maximising graph modularity. In schist
, instead, partitioning is performed maximising the Bayesian posterior probability P(b|A), that is the likehood of generating a network A with a partition b and it is obtained according to
Where P(A|θ,b) is the probability of obtaining the network A given the partition b and additional parameters θ; P(θ,b) is the probability of occurrence of the partition b having observed the netwok A; P(A) is the “model evidence” and it is the same for all possible partitions. Refer to the excellent graph-tool
documentation for more details. Note that maximising this quantity is equivalent to minimizing the entropy
The nested model introduces a hierarchy of priors used to infer the optimal recursive grouping of single cell groups. If you are familiar with Leiden or Louvain methods to find cell groups, you may think at this multilevel approach as a multiresolution one, except that it is not. Here, not only the cell groups at each hierarchy level are found maximising the equation above, but the hierarchy itself (hence the groups of groups) is part of the model.
Since there may be more than one fit with similar probability, schist
uses the graph-tool
routines to apply a Markov chain Monte Carlo sampling of the posterior distribution aiming to converge to the best model.
One of the main limitations of schist
is that it requires significantly more time than any other state of the art approach to identify cell groups. This cost comes with the benefit that it is possible to choose between different parameters according to the likelihood of a partition set to be found over a network.
When using the following invocation
nested_model(adata, collect_marginals=True, equilibrate=True)
an additional step (with fixed number of iterations) is added to execution. During this step, schist
collects the probability of having a certain number of groups for each level of the hierarchy. These are stored into adata.uns['nsbm']['group_marginals']
:
level = 2
S = adata.uns['nsbm']['group_marginals'][level].sum()
p = adata.uns['nsbm']['group_marginals'][level] / S
ng = range(1, len(p) + 1)
bar(ng, p)
xticks(ng)
xlabel('Number of groups')
ylabel('Probability')
title(f'Group marginals for level {level}')
Prior to version 0.3.4, this option would have collected posterior probabilities of cells to belong to a certain group. Since this quantity is interesting and it would have required a long time to compute, we'd rather calculate variation of global entropy by moving all cells to all identified groups and transform that into a probaility, which is now stored into adata.uns['nsbm']['cell_affinity']
. Here, a dictionary keyed with NSBM levels counts the times a cell has been successfully moved to a group. These probabilities can be efficiently used as covariates when looking for marker genes, this approach will weight the belief that a cell belongs to a group. We have prepared a notebook showing an example.
schist
provides an interface to graph-tool
to infer Nested Stochastic Block Models from single cell data in scanpy
. Once models are built, data are partitioned in multiple groups, linked together in hierarchical way. In order to represent a hierarchy, schist
implements a simple plot function that represents data using alluvial plots:
adata = schist.io.read('adata')
schist.pl.alluvial(adata)
This function will plot all levels in hierarchy by default. As many level are uninformative, they can be excluded from the plot
adata = schist.io.read('adata')
schist.pl.alluvial(adata, level_end=5)
Leaf levels can be also excluded
adata = schist.io.read('adata')
schist.pl.alluvial(adata, level_end=5, level_start=2)
graph-tools
has built-in functionalities to plot graphs. Some of these have been implemented into schist
using a syntax compatibile with scanpy
's functions. Note that Fruchterman-Reingold spring-block layout is already implemented into scanpy
, and it gives the same output.
adata = schist.io.read('adata')
schist.tl.draw_graph(adata, layout='fr')
sc.pl.draw_graph(adata, layout='fr', color='nsbm_level_2', legend_loc='on data')
However, schist
allows to seed the plot using the graph tree.
adata = schist.io.read('adata')
schist.tl.draw_graph(adata, layout='fr', use_tree=True)
sc.pl.draw_graph(adata, layout='fr', color='nsbm_level_2')
Default layout is SFDP spring-block layout
adata = schist.io.read('adata')
schist.tl.draw_graph(adata)
sc.pl.draw_graph(adata, layout='sfdp', color='nsbm_level_2', legend_loc='on data')
With tree information
adata = schist.io.read('adata')
schist.tl.draw_graph(adata, use_tree=True)
sc.pl.draw_graph(adata, layout='sfdp', color='nsbm_level_2')