Please cite using the zenodo DOI: 10.5281/zenodo.12726713
A package to perform single cell analysis using various tools. The package centers around using Seurat
objects and has functions for preprocessing, plotting, running differential expression and gene ontology, and naming popultions.
This package works with Seurat
version 3 objects.
If you use this package for your own analysis, please acknowledge this repository. You can also consider citing my eLife paper. The package I wrote for that paper was updated to work with Seurat
version 3 objects here.
Wells, K. L., Miller, C. N., Gschwind, A. R., Wei, W., Phipps, J. D., Anderson, M. S., & Steinmetz, L. M. (2020). Combined transient ablation and single cell rna sequencing reveals the development of medullary thymic epithelial cells. ELife, 9, 1–80. https://doi.org/10.7554/eLife.60188
Install clustifyr
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("clustifyr")
Install devtools
install.packages("devtools")
Install scAnalysisR
library(devtools)
install_github("CUAnschutzBDC/scAnalysisR")
This package is meant as a wrapper for many functions associated with single cell Analysis. It performs functions from loading in the data, processing the data, running differential expression, running gene ontology, and plotting.
There are a few steps that you will need to run outside of this package, for example filtering data based on quality, adding mitochondrial percents, and demultiplexing hashtags. If you want any functionality added here, feel free to submit an issue.
Data that has been processed with cellranger
can be read in using the create_seurat_object
function. This has a few important arguments
sample
- The name of the sample. This should match the sample name that you provided to cellranger as it is used to identify the path of the cellranger output. This name will also be added as theorig.ident
in the seurat object.count_path
- If you rancellranger
, the path to the output directory you gave cellranger. Otherwise, this path can be more specific. See below for details on the pathADT
- Were ADTs included in this run? If so, set this toTRUE
, if not set this toFALSE
. If ADTs were included you must set this toTRUE
because the input data from cellranger looks different if multiple assays were included.hashtag
- Were HTOs included? If so, set this toTRUE
.min_features
- What minimum features should be passed toCreateSeuratObject
. Default is 200min_cells
- What minimum cells should be passed toCreateSeuratObject
. Default is 3.hashtag_ident
- What are the identities of the hashtags that you passed in your antibody file to cellranger? This should be a list of hashtags iec("HTO1", "HTO2", "HTO3")
. This is used to pull out the hashtags and put them into a separate assay calledHTO
.tenx_structure
- What cellranger was run? Options arecount
,multi
,multi7
, ornone
. This tells how to find the input data.
Path to data
The path to the data is defined by:
sample
count_path
tenx_structure
If you ran cellranger count
on your data and gave it results
as your output directory and sample1
as your sample name, the files we need are in results/sample1/outs/filtered_feature_bc_matrix
. To tell this funtion how to find this data, you just need:
so <- create_seurat_object(sample = "sample1", count_path = "results", tenx_structure = "count")
If you ran cellranger multi
on your data and gave it results
as your output directory and sample1
as your sample name, the files we need are in results/sample1/outs/count/filtered_feature_bc_matrix
. To tell this funtion how to find this data, you just need:
so <- create_seurat_object(sample = "sample1", count_path = "results", tenx_structure = "multi")
If you ran cellranger multi
with cellranger 7 or above on your data and gave it results
as your output directory and sample1
as your sample name, the files we need are in results/sample1/outs/per_sample_outs/sample1/count/filtered_feature_bc_matrix
. To tell this funtion how to find this data, you just need:
so <- create_seurat_object(sample = "sample1", count_path = "results", tenx_structure = "multi7")
The goal of this function is to very easily find your files with minimal input from you. If there is a problem finding the files, the error message will print the path of the directory where it is looking for the files so you can more easily debug. In the case that your output files are not in the above configuration, you can provide the path to the directory that contains the matrix.mtx
, features.tsv
, and barcodes.tsv
files and have tenx_structure = "none"
so <- create_seurat_object(sample = "sample1", count_path = "full/path/to/filtered_feature_bc_matrix",
tenx_structure = "none")
HTOs and ADTs
If you included any antibody capture elements in your cellranger run, you will need to set either HTO
or ADT
or both to TRUE
. This is because the data will be read in as a list and the function will fail if it does not know to look for this.
- If
ADT = TRUE
andHTO = FALSE
, your Seurat object will be returned with anADT
assay with all of the features from your antibody file as the features. Your seurat object will also have anRNA
assay that contains the gene expression data. - If
ADT = TRUE
andHTO = TRUE
, your Seurat object will be returned with anADT
assay with all of the features from your antibody file except for those listed inhashtag_idents
as the features. It will also have aHTO
assay will all of the features from thehashtag_idents
as your features. If you don't sethashtag_idents
, it will search for any features that havehashtag
in the name. Your seurat object will also have anRNA
assay that contains the gene expression data. - If
ADT = FALSE
andHTO = TRUE
, your Seurat object will be returned withHTO
assay will all of the features from thehashtag_idents
as your features. If you don't sethashtag_idents
, it will search for any features that havehashtag
in the name. Your seurat object will also have anRNA
assay that contains the gene expression data.
To run PCA, use the function PCA_dimRed
.
so <- PCA_dimRed(so, assay = "RNA", reduction_name = NULL, vars_to_regress = NULL
This has only a handful of arguments
sample_object
- The seurat object loaded in bycreate_seurat_object
assay
- The assay to runPCA
on. This can be any assay in your object, for exampleRNA
,ADT
, orSCT
. If you are running on theADT
assay, if you have 30 or fewer ADTs you likely should just runUMAP
directly on the ADT expression data and not perform PCA first.reduction_name
- What the name of the reduction should be. These will be the names of the dimensional redution saved in the seurat object, so you can run this function on all assays without overriding the existing dimensional reductions. Many are provided but can be overridden by providing a avlue toreduction_name
. The default values are:- If
assay = "RNA"
:reduction_name = pca
- If
assay = "ADT"
:reduction_name = apca
- If
assay = "SCT"
:reduction_name = sctpca
- If
assay = integrated
:reduction_name = pca
- If
vars_to_regress
- Only used if running on theADT
assay. These are used when scaling the ADT data before running PCA.
The return of this is a seurat object with a new dimensional reduction named based on the reduction_name
provided
Helpful plots
To help evaluating the PCA, a plotting function is provided that returns a handful of qc plots.
qc_plots <- plot_PCA(so, HTO = FALSE, ADT = FALSE,
assay = "RNA", jackstraw = TRUE,
reduction = NULL, data_type = "RNA",
ffpe = TRUE)
This function takes your seurat object and returns helpful qc plots as a named list
- The top genes assocaited with PC1 and PC2
- PC plot colored by the sample (orig.ident)
- PC plot colored by the percent mitochondrial reads
- PC plot colored by the number of RNA features
- PC plot colored by the number of RNA counts
- PC plot colored by the number of ADT counts (if
ADT = TRUE
) - PC plot colored by the number of ADT feature (if
ADT = TRUE
) - The HTO classification (if
HTO = TRUE
) - A Jackstraw plot (if
jackstraw = TRUE
) to show how many PCs to use for downstream analysis. - An Elbow plot to show how many PCs to use for downstream analysis
This function will assume your pca reduction is named based on the default values above. If your pca reduction is not based on the reductions above, you can supply your own to reduction
.
The ffpe
is only used if data_type = spatial
. It is ignored if data_type = "RNA"
Note this function requires that you have identified the percent mitochondiral reads and saved them as a column "percent.mt" in your seurat object metadata. If you have HTOs, this also requires that you have a column callsed "HTO_classification" that identifies the different hashtagged samples.
The arguments of this function are:
sample_object
A seurat objectHTO
OPTIONAL if HTOs were included in the seurat object. Default is FALSEADT
OPTIONAL If ADTs were included in the experiment. Default is FALSEassay
OPTIONAL What assay to use. This is just used to locate the PCA reduction (the reduction name provided assumes that no reduction name was supplied to PCA_dimRed, if you did provide a different reduction name, you must use "reduction" below). Default is "RNA".jackstraw
OPTIONAL if a jackstraw plot should be made. This can help in determining number of PCs to use, but it can be very slow. If TRUE, a jackstraw plot will only be made if the assay is also "RNA". Default is TRUE.reduction
OPTIONAL the name of the PCA reduction. Not required if you used the default reduction names in PCA_dimRed. Default is NULLdata_type
OPTIONAL If the data is "RNA" or "spatial". Default is "RNA"ffpe
OPTIONAL If the data type is "spatial" is it frozen (FALSE) or ffpe (TRUE), This is important because ffpe samples are from a probe based approach and don't include mitochondiral genes.
One function exists to perform UMAP dimensonality reduction and clustering. This is because it is best to keep the number of PCs consistent when running both of these functions.
To generate a UMAP and clusters
umap_res <- group_cells(so, sample_name = NULL, save_dir = NULL,
nPCs = 10, resolution = 0.8, assay = "RNA",
HTO = FALSE, reduction = NULL)
so <- umap_res$object
umap_plots <- umap_res$plots
This function both generates a UMAP and clustering and generates helpful plots to visualize the clustering and the samples. Because it does both, this function returns a list that includes the plots and your object. Be sure to follow the code above to make sure that you are able to get both results.
In this function, setting a save_dir will also save the output plots as pdfs. Leaving as NULL will not save any plots.
- ggrastr
- ggrastr
Example scripts using this package are available at https://github.com/CUAnschutzBDC/snakemake_pipelines/tree/main/scRNA_seq/src/scripts/indiviual_analysis. Follow the scripts in order for a complete analysis.
For examples of integrating data, scripts are available https://github.com/CUAnschutzBDC/snakemake_pipelines/tree/main/scRNA_seq/src/scripts/integrated_analysis