-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.Rmd
162 lines (107 loc) · 6.5 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "80%",
out.height = "80%",
fig.align = "center",
dpi = 250
)
```
# TMSig: Tools for Analyzing Molecular Signatures
<!-- badges: start -->
![R package version](https://img.shields.io/github/r-package/v/EMSL-Computing/TMSig?label=R%20package)
<!-- badges: end -->
The `TMSig` **R** package contains tools to prepare, analyze, and visualize
named lists of mathematical sets, with an emphasis on molecular signatures (such
as gene or kinase sets). It includes fast, memory efficient functions to
construct sparse incidence and similarity matrices and filter, cluster, invert,
and decompose sets. Additionally, bubble heatmaps can be created to visualize
the results of any differential or molecular signatures analysis.
We define a molecular signature as any collection of genes, proteins, post-translational modifications (PTMs), metabolites, lipids, or other molecules with an associated biological interpretation. Most molecular signatures databases are gene-centric, such as the Molecular Signatures Database (MSigDB; Liberzon et al., 2011, 2015), though there are others like the Metabolomics Workbench Reference List of Metabolite Names (RefMet) database (Fahy & Subramaniam, 2020).
## Installation
To install this package, start R (>= 4.4.0) and enter:
```r
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("TMSig")
```
You can install the development version of `TMSig` like so:
``` r
if (!require("devtools", quietly = TRUE))
install.packages("devtools")
# Install package and build vignettes
devtools::install_github("EMSL-Computing/TMSig", build_vignettes = TRUE)
```
## Overview
Below is an overview of some of the core functions.
- `readGMT`: create a named list of sets from a GMT file.
- `sparseIncidence`: compute a sparse incidence matrix with unique sets as rows and unique elements as columns. A value of 1 indicates that a particular element is a member of that set, while a value of 0 indicates that it is not.
- `similarity`: compute the matrix of pairwise Jaccard, overlap, or Ōtsuka similarity coefficients for all pairs of sets $A$ and $B$, where
- Jaccard(A, B) = $\frac{|A \cap B|}{|A \cup B|}$
- Overlap(A, B) = $\frac{|A \cap B|}{\min(|A|, |B|)}$
- Ōtsuka(A, B) = $\frac{|A \cap B|}{\sqrt{|A| \times |B|}}$
- `filterSets`: restrict sets to only those elements in a pre-determined background, if provided, and only keep those that pass minimum and maximum size thresholds.
- `clusterSets`: hierarchical clustering of highly similar sets. Used to reduce redundancy prior to analysis.
- `decomposeSets`: decompose all pairs of sufficiently overlapping sets, $A$ and $B$, into 3 disjoint parts.
1. The elements unique to $A$: $A \setminus B$ ("A minus B")
2. The elements unique to $B$: $B \setminus A$ ("B minus A")
3. The elements common to both $A$ and $B$: $A \cap B$ ("A and B")
- `invertSets`: swap positions of sets and elements so that elements become set names and set names become elements.
- `cameraPR.matrix`: a fast matrix method for `limma::cameraPR` for testing molecular signatures in one or more contrasts. Pre-Ranked Correlation Adjusted MEan RAnk gene set testing (CAMERA-PR) accounts for inter-gene correlation to control the type I error rate (Wu & Smyth, 2012).
- `enrichmap`: visualize molecular signature analysis results, such as those from `cameraPR.matrix`, as a bubble heatmap with signatures as rows and contrasts as columns.
```{r bubble-heatmap, echo=FALSE, fig.alt="Example bubble heatmap."}
knitr::include_graphics("figures/bubble_heatmap.png")
```
## Examples
Please refer to `vignette(topic = "TMSig", package = "TMSig")` for examples of
how to use this package.
```{r}
library(TMSig)
# Named list of sets
x <- list("Set1" = letters[1:5],
"Set2" = letters[1:4], # subset of Set1
"Set3" = letters[1:4], # aliased with Set2
"Set4" = letters[1:3], # subset of Set1-Set3
"Set5" = c("a", "a", NA), # duplicates and NA
"Set6" = c("x", "y", "z"), # distinct elements
"Set7" = letters[3:6]) # overlaps with Set1-Set5
x
```
```{r}
(imat <- sparseIncidence(x)) # incidence matrix
tcrossprod(imat) # pairwise intersection and set sizes
crossprod(imat) # occurrence of each element and pair of elements
```
```{r}
## Calculate matrices of pairwise Jaccard and overlap similarity coefficients
similarity(x) # Jaccard (default)
similarity(x, type = "overlap") # overlap
similarity(x, type = "otsuka") # Ōtsuka
```
```{r}
## Cluster sets based on their similarity
# Cluster aliased sets
clusterSets(x, cutoff = 1)
# Cluster subsets
clusterSets(x, cutoff = 1, type = "overlap")
```
## Issues
If you encounter a problem with TMSig, please [create a new issue](https://github.com/EMSL-Computing/TMSig/issues) that includes:
1. A clear statement of the problem in the title
2. A (small) reproducible example
3. Additional detailed explanation, as needed
4. Output of `sessionInfo()`
## Pull Requests
- Verify that `devtools::check(document = TRUE)` runs without errors, warnings, or notes before [submitting a pull request](https://github.com/EMSL-Computing/TMSig/pulls).
- All contributed code should, ideally, adhere to the tidyverse style guide: https://style.tidyverse.org/index.html. This makes it easier for others to understand, diagnose problems, and make changes. When in doubt, refer to the existing codebase.
## References
Fahy, E., & Subramaniam, S. (2020). RefMet: A reference nomenclature for metabolomics. _Nature Methods, 17_(12), 1173–1174. [doi:10.1038/s41592-020-01009-y](https://doi.org/10.1038/s41592-020-01009-y)
Liberzon, A., Subramanian, A., Pinchback, R., Thorvaldsdóttir, H., Tamayo, P., & Mesirov, J. P. (2011). Molecular signatures database (MSigDB) 3.0. _Bioinformatics, 27_(12), 1739–1740. [doi:10.1093/bioinformatics/btr260](https://doi.org/10.1093/bioinformatics/btr260)
Liberzon, A., Birger, C., Thorvaldsdóttir, H., Ghandi, M., Mesirov, J. P., & Tamayo, P. (2015). The Molecular Signatures Database (MSigDB) hallmark gene set collection. _Cell systems, 1_(6), 417–425. [doi:10.1016/j.cels.2015.12.004](https://doi.org/10.1016/j.cels.2015.12.004)
Wu, D., & Smyth, G. K. (2012). Camera: A competitive gene set test accounting for inter-gene correlation. _Nucleic Acids Research, 40_(17), e133–e133. https://doi.org/10.1093/nar/gks461