-
Notifications
You must be signed in to change notification settings - Fork 12
/
5-lca.Rmd
145 lines (103 loc) · 4.53 KB
/
5-lca.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
title: "Latent class analysis"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Highlights
* Latent class analysis is designed for creating clusters out of binary or categorical variables.
* It is a **model-based clustering** method based on maximum likelihood estimation.
* It provides **soft-clustering**: each observation has a probability distribution over cluster membership.
* It uses a **greedy algorithm** with performance that varies with each analysis. Therefore we often will re-run an analysis 10 or 20 times and select the iteration with the best model fit.
* It originates from the broader class of **finite mixture models**.
* Latent dirichlet allocation (LDA) is a Bayesian form of LCA.
## Load data
```{r load_data}
# From 1-clean-data.Rmd
data = rio::import("data/clean-data-imputed.RData")
names(data)
str(data)
(task = list(
factors = names(data)[sapply(data, is.factor)]))
```
## Basic LCA
```{r run_lca}
vars = task$factors
# Specify the formula based on the variables we want to use for clustering.
(f = as.formula(paste0("cbind(", paste(vars, collapse = ", "), ") ~ 1")))
model = poLCA(f, data, nclass = 2, maxiter = 100, nrep = 10, tol = 1e-7)
```
## Compare number of classes
```{r compare_classes}
library(knitr)
library(kableExtra)
run_lca = function(data, formula, nclass = 2, nrep = 10,
maxiter = 5000, tolerance = 1e-7, verbose = FALSE) {
# TODO: capture output if verbose = FALSE
lca = try(poLCA(formula, data, nclass = nclass, maxiter = maxiter, nrep = reps,
tol = tolerance))
if ("try-error" %in% class(lca)) {
cat("Ran into error.\n")
results = list(nclass = nclass, aic = NA, bic = NA, converged = NA, object = NULL)
} else {
converged = lca$numiter < lca$maxiter
# TODO: return object.
results = list(nclass = nclass, aic = lca$aic, bic = lca$bic, converged = converged,
object = lca)
}
return(results)
}
reps = 5L
results = data.frame(matrix(NA, nrow = 0, ncol = 5))
colnames = c("nclass", "aic", "bic", "converged", "description")
colnames(results) = colnames
set.seed(1)
objects = list()
outcome = NULL
# We could use nicer names for the variables here.
var_labels = vars
# Try 2 to 6 classes (clusters)
for (classes in 2:6) {
cat("Running LCA for", classes, "classes.\n")
####################
# Run the latent class analysis.
result = run_lca(data, f, nclass = classes, nrep = reps)
# Extract the return object and store separately.
object = result$object
# Remove the object from the list so that we can store everything else in a dataframe.
result$object = NULL
# Could put something in here.
result$description = ""
# Use a string list name so that we don't need a null [[1]] element.
objects[[as.character(classes)]] = object
# This will be a data.frame of summary statistics, 1 row per LCA.
results = rbind(results, result, stringsAsFactors = FALSE)
# Correct colnames again.
colnames(results) = colnames
#####
# Generate covariate table.
# Currently restricting to complete cases. Could use imputed data instead possibly.
temp_data = na.omit(data[, c(outcome, vars)])
# This function is defined in R/lca-covariate-table.R
table = lca_covariate_table(object, var_labels = var_labels,
outcome = NULL,
format = "html", # could be latex instead.
data = temp_data[, vars])
################
print(kable(table, booktabs = TRUE, digits = c(rep(1, classes), 3, 3),
col.names = linebreak(colnames(table), align = "c"),
escape = FALSE) %>%
kable_styling(latex_options = c("scale_down"#,
# "striped"
)) %>%
row_spec(nrow(table) - 1L, hline_after = TRUE))
# TODO: generate Excel export also, ideally as a separate tab within a single file.
}
results
```
## Resources
* Collins and Lanza. (2009). [Latent class and latent transition analysis](https://smile.amazon.com/Latent-Class-Transition-Analysis-Applications/dp/0470228393/)
* [Latent profile analysis in R](https://cran.r-project.org/web/packages/tidyLPA/vignettes/Introduction_to_tidyLPA.html)
## References
Figueroa, S. C., Kennedy, C. J., Wesseling, C., Wiemels, J. M., Morimoto, L., & Mora, A. M. (2020). Early immune stimulation and childhood acute lymphoblastic leukemia in Costa Rica: A comparison of statistical approaches. Environmental Research, 182, 109023.