syntheval
makes it simple to evaluate the utility and disclosure risks
of synthetic data. The package is designed to work any data.frame
objects or postsynth
objects from tidysynthesis
.
This package is still under active development before we release our first major version, 0.1.0. This will involve API changes and new functionality. You can keep track of our work in the following issues:
For detailed documentation, you can see our documentation website.
Note: library(tidysynthesis)
is currently under private
development but will be made public in Q1 of 2025.
install.packages("remotes")
remotes::install_github("UrbanInstitute/syntheval")
library(tidyverse)
library(syntheval)
The following examples demonstrate utility and disclosure risk metrics
using synthetic data based on the Palmer
Penguins dataset.
library(syntheval)
contains three built-in data sets:
penguins_conf
: Pre-processedpenguins
data that were passed into the synthesizer.penguins_postsynth
: Apostsynth
object synthesized frompenguins
usinglibrary(tidysynthesis)
.penguins_syn_df
: A data frame pulled frompenguins_postsynth
. This is used to demonstrate howlibrary(syntheval)
works with output from a synthesizer different thanlibrary(tidysynthesis)
.
Functions like util_proportions()
and util_moments()
have different
behaviors for postsynth
objects and data frames. By default, they only
show synthesized variables for postsynth
objects and show all common
variables for data frames. The common_vars
and synth_vars
arguments
can change this behavior.
util_proportions()
compares the proportions of classes from
categorical variables in the original data and synthetic data.
util_proportions(
postsynth = penguins_postsynth,
data = penguins_conf
)
# A tibble: 2 × 5
variable class synthetic original difference
<chr> <fct> <dbl> <dbl> <dbl>
1 sex female 0.529 0.495 0.0330
2 sex male 0.471 0.505 -0.0330
All common variables are shown when using a data frame.
util_proportions(
postsynth = penguins_syn_df,
data = penguins_conf
)
# A tibble: 8 × 5
variable class synthetic original difference
<chr> <fct> <dbl> <dbl> <dbl>
1 island Biscoe 0.465 0.489 -0.0240
2 island Dream 0.414 0.369 0.0450
3 island Torgersen 0.120 0.141 -0.0210
4 sex female 0.529 0.495 0.0330
5 sex male 0.471 0.505 -0.0330
6 species Adelie 0.459 0.438 0.0210
7 species Chinstrap 0.234 0.204 0.0300
8 species Gentoo 0.306 0.357 -0.0511
util_moments()
compares the counts, means, standard deviations,
skewnesses, and kurtoses of the original data and synthetic data.
util_moments(
postsynth = penguins_postsynth,
data = penguins_conf
)
# A tibble: 20 × 6
variable statistic original synthetic difference proportion_difference
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 bill_length_mm count 3.33e+2 333 0 0
2 bill_length_mm mean 4.40e+1 43.5 -0.502 -0.0114
3 bill_length_mm sd 5.47e+0 5.54 0.0723 0.0132
4 bill_length_mm skewness 4.51e-2 0.0646 0.0195 0.432
5 bill_length_mm kurtosis -8.88e-1 -0.948 -0.0598 0.0674
6 bill_depth_mm count 3.33e+2 333 0 0
7 bill_depth_mm mean 1.72e+1 17.3 0.122 0.00712
8 bill_depth_mm sd 1.97e+0 1.89 -0.0762 -0.0387
9 bill_depth_mm skewness -1.49e-1 -0.278 -0.129 0.867
10 bill_depth_mm kurtosis -8.97e-1 -0.742 0.155 -0.172
11 flipper_length… count 3.33e+2 333 0 0
12 flipper_length… mean 2.01e+2 199. -1.70 -0.00847
13 flipper_length… sd 1.40e+1 13.9 -0.135 -0.00961
14 flipper_length… skewness 3.59e-1 0.611 0.253 0.705
15 flipper_length… kurtosis -9.65e-1 -0.704 0.261 -0.270
16 body_mass_g count 3.33e+2 333 0 0
17 body_mass_g mean 4.21e+3 4162. -45.0 -0.0107
18 body_mass_g sd 8.05e+2 783. -22.3 -0.0277
19 body_mass_g skewness 4.70e-1 0.655 0.185 0.394
20 body_mass_g kurtosis -7.40e-1 -0.388 0.353 -0.477
util_totals()
is similar to util_moments()
but looks at counts and
totals.
util_totals(
postsynth = penguins_postsynth,
data = penguins_conf
)
# A tibble: 8 × 6
variable statistic original synthetic difference proportion_difference
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 bill_length_mm count 333 333 0 0
2 bill_length_mm total 14650. 14483. -167. -0.0114
3 bill_depth_mm count 333 333 0 0
4 bill_depth_mm total 5716. 5757. 40.7 0.00712
5 flipper_length_… count 333 333 0 0
6 flipper_length_… total 66922 66355 -567 -0.00847
7 body_mass_g count 333 333 0 0
8 body_mass_g total 1400950 1385950 -15000 -0.0107
util_percentiles()
compares percentiles from the original data and
synthetic data. The default percentiles are c(0.1, 0.5, 0.9)
and can
be easily overwritten.
util_percentiles(
postsynth = penguins_postsynth,
data = penguins_conf,
probs = c(0.5, 0.8)
)
# A tibble: 8 × 6
p variable original synthetic difference proportion_difference
<dbl> <fct> <dbl> <dbl> <dbl> <dbl>
1 0.5 bill_length_mm 44.5 43.5 -1 -0.0225
2 0.8 bill_length_mm 49.5 49.1 -0.440 -0.00889
3 0.5 bill_depth_mm 17.3 17.5 0.200 0.0116
4 0.8 bill_depth_mm 18.9 19.0 0.0600 0.00317
5 0.5 flipper_length_mm 197 195 -2 -0.0102
6 0.8 flipper_length_mm 215 214 -1 -0.00465
7 0.5 body_mass_g 4050 3950 -100 -0.0247
8 0.8 body_mass_g 4990 4850 -140 -0.0281
The functions are designed to work well with library(ggplot2)
.
util_percentiles(
postsynth = penguins_postsynth,
data = penguins_conf,
probs = seq(0.01, 0.99, 0.01)
) |>
pivot_longer(
cols = c(original, synthetic),
names_to = "source",
values_to = "value"
) |>
ggplot(aes(x = p, y = value, color = source)) +
geom_line() +
facet_wrap(~ variable, scales = "free")
util_ks_distance()
shows the Kolmogorov-Smirnov distance between the
original distribution and synthetic distribution for numeric variables.
The function also returns the point(s) of the maximum distance.
util_ks_distance(
postsynth = penguins_syn_df,
data = penguins_conf
)
# A tibble: 14 × 3
variable value D
<chr> <dbl> <dbl>
1 bill_length_mm 38.7 0.0601
2 bill_depth_mm 16.7 0.0511
3 bill_depth_mm 16.7 0.0511
4 bill_depth_mm 16.8 0.0511
5 bill_depth_mm 16.8 0.0511
6 flipper_length_mm 196. 0.0781
7 flipper_length_mm 196. 0.0781
8 flipper_length_mm 197. 0.0781
9 flipper_length_mm 197. 0.0781
10 flipper_length_mm 197. 0.0781
11 body_mass_g 4359. 0.0480
12 body_mass_g 4370. 0.0480
13 body_mass_g 4381. 0.0480
14 body_mass_g 4392. 0.0480
util_co_occurrence()
differences the lower triangles of co-occurrence
matrices calculated on numeric variables in the original data and
synthetic data.
co_occurrence <- util_co_occurrence(
postsynth = penguins_postsynth,
data = penguins_conf
)
co_occurrence$co_occurrence_difference
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm NA NA NA NA
bill_depth_mm 0 NA NA NA
flipper_length_mm 0 0 NA NA
body_mass_g 0 0 0 NA
The function returns the MAE for co-occurrences, which provides a sense of the median error between the original and synthetic data. The function also returns the RMSE for co-occurrences, which provides a sense of the average error between the original data and synthetic data.
co_occurrence$co_occurrence_difference_mae
[1] 0
co_occurrence$co_occurrence_difference_rmse
[1] 0
All observations have non-zero bill_length_mm
, bill_depth_mm
,
flipper_length_mm
, and body_mass_g
. util_co_occurrence()
is most
useful for economic variables like income and wealth where 0
is a
common value.
util_corr_fit()
differences the lower triangles of correlation
matrices calculated on numeric variables in the original data and
synthetic data.
corr_fit <- util_corr_fit(
postsynth = penguins_postsynth,
data = penguins_conf
)
round(corr_fit$correlation_difference, digits = 3)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
bill_length_mm NA NA NA NA
bill_depth_mm -0.069 NA NA NA
flipper_length_mm -0.003 -0.048 NA NA
body_mass_g 0.031 -0.011 -0.08 NA
The function returns the MAE for correlation coefficients, which provides a sense of the median error between the original and synthetic data. The function also returns the RMSE for the correlation coefficients, which provides a sense of the average error between the original synthetic data.
corr_fit$correlation_difference_mae
[1] 0.04034565
corr_fit$correlation_difference_rmse
[1] 0.04922324
util_ci_overlap()
compares a linear regression models estimated on the
original data and synthetic data. formula
specifies the functional
form of the regression model.
ci_overlap <- util_ci_overlap(
postsynth = penguins_postsynth,
data = penguins_conf,
formula = body_mass_g ~ bill_length_mm + sex
)
$ci_overlap()
summarizes each coefficient including how much the
confidence intervals overlap, if the signs match, and if the statistical
significance matches.
ci_overlap$ci_overlap
# A tibble: 3 × 8
term overlap coef_diff std_coef_diff sign_match significance_match ss_match
<chr> <dbl> <dbl> <dbl> <lgl> <lgl> <lgl>
1 (Inter… 0.963 -27.8 -0.0978 TRUE TRUE TRUE
2 bill_l… 0.965 0.917 0.138 TRUE TRUE TRUE
3 sexmale 0.951 -14.0 -0.192 TRUE TRUE TRUE
# ℹ 1 more variable: sso_match <lgl>
$coefficient
provides detail for each coefficient and is useful for
data visualization.
ci_overlap$coefficient
# A tibble: 6 × 8
source term estimate std.error statistic p.value conf.low conf.high
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 original (Intercept) 746. 285. 2.62 9.22e- 3 186. 1306.
2 original bill_lengt… 74.0 6.67 11.1 1.51e-24 60.9 87.1
3 original sexmale 405. 72.8 5.57 5.43e- 8 262. 548.
4 synthetic (Intercept) 718. 264. 2.72 6.77e- 3 200. 1237.
5 synthetic bill_lengt… 74.9 6.25 12.0 8.98e-28 62.7 87.2
6 synthetic sexmale 391. 69.2 5.65 3.42e- 8 255. 527.
ci_overlap$coefficient |>
ggplot(aes(x = estimate, xmin = conf.low, xmax = conf.high, y = term, color = source)) +
geom_pointrange(alpha = 0.5, position = position_dodge(width = 0.5)) +
labs(
title = "The Synthesizer Recreates the Point Estimates and Confidence Intervals",
subtitle = "Regression Confidence Interval Overlap"
)
Discriminant-based metrics build models to predict if an observation is original or synthetic and then evaluate those model predictions. Ideally, it should be difficult for a model to distinguish, or discriminate, between original observations and synthetic observations.
- Any classification model that generates probabilities from
library(tidymodels)
can be used to generate propensities (the estimated probability than observation is synthetic). - By default, the code creates a training/testing split and returns
separate metrics for each split. This can be turned off with
split = FALSE
. - The code can handle hyperparameter tuning.
- After calculating propensities, the code can calculate ROC AUC, SPECKS, pMSE, and pMSE ratio.
Discriminant-based metrics are built a discrimination
object created
by discrimination()
.
disc1 <- discrimination(postsynth = penguins_postsynth, data = penguins_conf)
Next, we use library(tidymodels)
to specify a model. We recommend the
tidymodels tutorial to learn more.
library(tidymodels)
rpart_rec <- recipe(
.source_label ~ .,
data = disc1$combined_data
)
rpart_mod <- decision_tree(cost_complexity = 0.01) |>
set_mode(mode = "classification") |>
set_engine(engine = "rpart")
Next, we fit the model to the data to generate predicted probabilities.
disc1 <- disc1 |>
add_propensities(
recipe = rpart_rec,
spec = rpart_mod
)
At this point, we can use
add_discriminator_auc()
to add the ROC AUC for the predicted probabilitiesadd_specks()
to add SPECKS for the predicted probabilitiesadd_pmse()
to add pMSE for the predicted probabilitiesadd_pmse_ratio(times = 25)
to add the pMSE ratio using the pMSE model and 25 bootstrap samples
disc1 |>
add_discriminator_auc() |>
add_specks() |>
add_pmse() |>
add_pmse_ratio(times = 25)
$combined_data
# A tibble: 666 × 8
.source_label species island sex bill_length_mm bill_depth_mm
<fct> <fct> <fct> <fct> <dbl> <dbl>
1 original Adelie Torgersen male 39.1 18.7
2 original Adelie Torgersen female 39.5 17.4
3 original Adelie Torgersen female 40.3 18
4 original Adelie Torgersen female 36.7 19.3
5 original Adelie Torgersen male 39.3 20.6
6 original Adelie Torgersen female 38.9 17.8
7 original Adelie Torgersen male 39.2 19.6
8 original Adelie Torgersen female 41.1 17.6
9 original Adelie Torgersen male 38.6 21.2
10 original Adelie Torgersen male 34.6 21.1
# ℹ 656 more rows
# ℹ 2 more variables: flipper_length_mm <dbl>, body_mass_g <dbl>
$propensities
# A tibble: 666 × 10
.pred_synthetic .source_label .sample species island sex bill_length_mm
<dbl> <fct> <chr> <fct> <fct> <fct> <dbl>
1 0.154 original training Adelie Torgersen male 39.1
2 0.154 original training Adelie Torgersen fema… 39.5
3 0.369 original training Adelie Torgersen fema… 40.3
4 0.596 original testing Adelie Torgersen fema… 36.7
5 0.154 original training Adelie Torgersen male 39.3
6 0.154 original training Adelie Torgersen fema… 38.9
7 0.714 original training Adelie Torgersen male 39.2
8 0.369 original training Adelie Torgersen fema… 41.1
9 0.596 original testing Adelie Torgersen male 38.6
10 0.4 original training Adelie Torgersen male 34.6
# ℹ 656 more rows
# ℹ 3 more variables: bill_depth_mm <dbl>, flipper_length_mm <dbl>,
# body_mass_g <dbl>
$discriminator
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
n= 498
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 498 249 synthetic (0.50000000 0.50000000)
2) bill_depth_mm>=16.65 332 153 synthetic (0.53915663 0.46084337)
4) bill_length_mm< 34.2 10 1 synthetic (0.90000000 0.10000000) *
5) bill_length_mm>=34.2 322 152 synthetic (0.52795031 0.47204969)
10) bill_length_mm>=42.6 128 49 synthetic (0.61718750 0.38281250)
20) flipper_length_mm< 194.5 48 11 synthetic (0.77083333 0.22916667) *
21) flipper_length_mm>=194.5 80 38 synthetic (0.52500000 0.47500000)
42) bill_length_mm< 52.45 73 32 synthetic (0.56164384 0.43835616)
84) bill_length_mm>=44.25 64 25 synthetic (0.60937500 0.39062500)
168) body_mass_g>=4175 27 6 synthetic (0.77777778 0.22222222) *
169) body_mass_g< 4175 37 18 original (0.48648649 0.51351351)
338) bill_length_mm< 45.65 8 2 synthetic (0.75000000 0.25000000) *
339) bill_length_mm>=45.65 29 12 original (0.41379310 0.58620690) *
85) bill_length_mm< 44.25 9 2 original (0.22222222 0.77777778) *
43) bill_length_mm>=52.45 7 1 original (0.14285714 0.85714286) *
11) bill_length_mm< 42.6 194 91 original (0.46907216 0.53092784)
22) bill_length_mm< 39.65 129 62 synthetic (0.51937984 0.48062016)
44) flipper_length_mm>=180.5 121 56 synthetic (0.53719008 0.46280992)
88) bill_length_mm>=36.1 96 41 synthetic (0.57291667 0.42708333)
176) island=Biscoe 29 9 synthetic (0.68965517 0.31034483) *
177) island=Dream,Torgersen 67 32 synthetic (0.52238806 0.47761194)
354) bill_length_mm< 38.75 47 19 synthetic (0.59574468 0.40425532) *
355) bill_length_mm>=38.75 20 7 original (0.35000000 0.65000000)
710) flipper_length_mm>=190.5 7 2 synthetic (0.71428571 0.28571429) *
711) flipper_length_mm< 190.5 13 2 original (0.15384615 0.84615385) *
89) bill_length_mm< 36.1 25 10 original (0.40000000 0.60000000) *
45) flipper_length_mm< 180.5 8 2 original (0.25000000 0.75000000) *
23) bill_length_mm>=39.65 65 24 original (0.36923077 0.63076923) *
3) bill_depth_mm< 16.65 166 70 original (0.42168675 0.57831325)
6) bill_length_mm>=51.35 10 2 synthetic (0.80000000 0.20000000) *
7) bill_length_mm< 51.35 156 62 original (0.39743590 0.60256410)
14) body_mass_g>=3125 149 62 original (0.41610738 0.58389262)
28) body_mass_g< 4387.5 34 15 synthetic (0.55882353 0.44117647)
56) bill_depth_mm>=13.95 25 8 synthetic (0.68000000 0.32000000) *
57) bill_depth_mm< 13.95 9 2 original (0.22222222 0.77777778) *
29) body_mass_g>=4387.5 115 43 original (0.37391304 0.62608696)
58) body_mass_g>=4612.5 104 42 original (0.40384615 0.59615385)
116) bill_depth_mm< 14.15 18 6 synthetic (0.66666667 0.33333333) *
117) bill_depth_mm>=14.15 86 30 original (0.34883721 0.65116279) *
59) body_mass_g< 4612.5 11 1 original (0.09090909 0.90909091) *
15) body_mass_g< 3125 7 0 original (0.00000000 1.00000000) *
$discriminator_auc
# A tibble: 2 × 4
.sample .metric .estimator .estimate
<fct> <chr> <chr> <dbl>
1 training roc_auc binary 0.742
2 testing roc_auc binary 0.425
$pmse
# A tibble: 2 × 4
.source .pmse .null_pmse .pmse_ratio
<fct> <dbl> <dbl> <dbl>
1 training 0.0466 0.0320 1.46
2 testing 0.0437 0.0327 1.34
$specks
# A tibble: 2 × 2
.source .specks
<fct> <dbl>
1 training 0.390
2 testing 0.143
attr(,"class")
[1] "discrimination"
Finally, we can look at variable importance and the decision tree from our discriminator.
library(vip)
library(rpart.plot)
disc1$discriminator |>
extract_fit_parsnip() |>
vip()
disc1$discriminator$fit$fit$fit |>
prp()
Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
Call prp with roundint=FALSE,
or rebuild the rpart model with model=TRUE.
Let’s repeat the workflow from above with LASSO logistic regression and hyperparameter tuning.
# create discrimination
disc2 <- discrimination(postsynth = penguins_postsynth, data = penguins_conf)
# create a recipe that includes 2nd-degree polynomials, dummy variables, and
# standardization
lasso_rec <- recipe(
.source_label ~ .,
data = disc2$combined_data
) |>
step_poly(all_numeric_predictors(), degree = 2) |>
step_dummy(all_nominal_predictors()) |>
step_normalize(all_predictors())
# create the model
lasso_mod <- logistic_reg(
penalty = tune(),
mixture = 1
) |>
set_engine(engine = "glmnet") |>
set_mode(mode = "classification")
# create a tuning grid
lasso_grid <- grid_regular(penalty(), levels = 10)
# add the propensities
disc2 <- disc2 |>
add_propensities_tuned(
recipe = lasso_rec,
spec = lasso_mod,
grid = lasso_grid
)
# calculate metrics
disc2 |>
add_discriminator_auc() |>
add_specks() |>
add_pmse() |>
add_pmse_ratio(times = 25)
$combined_data
# A tibble: 666 × 8
.source_label species island sex bill_length_mm bill_depth_mm
<fct> <fct> <fct> <fct> <dbl> <dbl>
1 original Adelie Torgersen male 39.1 18.7
2 original Adelie Torgersen female 39.5 17.4
3 original Adelie Torgersen female 40.3 18
4 original Adelie Torgersen female 36.7 19.3
5 original Adelie Torgersen male 39.3 20.6
6 original Adelie Torgersen female 38.9 17.8
7 original Adelie Torgersen male 39.2 19.6
8 original Adelie Torgersen female 41.1 17.6
9 original Adelie Torgersen male 38.6 21.2
10 original Adelie Torgersen male 34.6 21.1
# ℹ 656 more rows
# ℹ 2 more variables: flipper_length_mm <dbl>, body_mass_g <dbl>
$propensities
# A tibble: 666 × 10
.pred_synthetic .source_label .sample species island sex bill_length_mm
<dbl> <fct> <chr> <fct> <fct> <fct> <dbl>
1 0.409 original training Adelie Torgersen male 39.1
2 0.455 original training Adelie Torgersen fema… 39.5
3 0.360 original testing Adelie Torgersen fema… 40.3
4 0.496 original training Adelie Torgersen fema… 36.7
5 0.420 original training Adelie Torgersen male 39.3
6 0.457 original training Adelie Torgersen fema… 38.9
7 0.539 original training Adelie Torgersen male 39.2
8 0.354 original training Adelie Torgersen fema… 41.1
9 0.483 original training Adelie Torgersen male 38.6
10 0.681 original testing Adelie Torgersen male 34.6
# ℹ 656 more rows
# ℹ 3 more variables: bill_depth_mm <dbl>, flipper_length_mm <dbl>,
# body_mass_g <dbl>
$discriminator
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps
• step_poly()
• step_dummy()
• step_normalize()
── Model ───────────────────────────────────────────────────────────────────────
Call: glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial", alpha = ~1)
Df %Dev Lambda
1 0 0.00 0.036370
2 3 0.07 0.033140
3 3 0.18 0.030200
4 3 0.27 0.027520
5 3 0.35 0.025070
6 3 0.41 0.022840
7 5 0.50 0.020820
8 5 0.63 0.018970
9 5 0.75 0.017280
10 5 0.86 0.015750
11 5 0.95 0.014350
12 5 1.02 0.013070
13 4 1.08 0.011910
14 4 1.13 0.010850
15 5 1.16 0.009889
16 5 1.29 0.009010
17 6 1.40 0.008210
18 7 1.50 0.007481
19 8 1.58 0.006816
20 8 1.65 0.006210
21 9 1.72 0.005659
22 10 1.79 0.005156
23 10 1.86 0.004698
24 10 1.91 0.004281
25 10 1.96 0.003900
26 10 2.00 0.003554
27 10 2.04 0.003238
28 10 2.06 0.002950
29 10 2.09 0.002688
30 10 2.11 0.002450
31 10 2.12 0.002232
32 11 2.14 0.002034
33 11 2.16 0.001853
34 11 2.18 0.001688
35 11 2.19 0.001538
36 11 2.20 0.001402
37 11 2.21 0.001277
38 11 2.22 0.001164
39 11 2.22 0.001060
40 11 2.23 0.000966
41 11 2.23 0.000880
42 12 2.24 0.000802
43 12 2.24 0.000731
44 12 2.24 0.000666
45 12 2.24 0.000607
46 13 2.25 0.000553
...
and 6 more lines.
$discriminator_auc
# A tibble: 2 × 4
.sample .metric .estimator .estimate
<fct> <chr> <chr> <dbl>
1 training roc_auc binary 0.601
2 testing roc_auc binary 0.475
$pmse
# A tibble: 2 × 4
.source .pmse .null_pmse .pmse_ratio
<fct> <dbl> <dbl> <dbl>
1 training 0.00732 0.00736 0.996
2 testing 0.00829 0.00745 1.11
$specks
# A tibble: 2 × 2
.source .specks
<fct> <dbl>
1 training 0.157
2 testing 0.167
attr(,"class")
[1] "discrimination"
# look at variable importance
library(vip)
disc2$discriminator |>
extract_fit_parsnip() |>
vip()
Many utility metrics include a group_by
argument to group the metrics
by group during calculation. For example, this code calculates moments
by species.
util_moments(
postsynth = penguins_postsynth,
data = penguins_conf,
group_by = species
)
# A tibble: 60 × 7
species variable statistic original synthetic difference
<fct> <fct> <fct> <dbl> <dbl> <dbl>
1 Adelie bill_length_mm count 146 153 7
2 Chinstrap bill_length_mm count 68 78 10
3 Gentoo bill_length_mm count 119 102 -17
4 Adelie bill_length_mm mean 38.8 38.5 -0.294
5 Chinstrap bill_length_mm mean 48.8 47.3 -1.56
6 Gentoo bill_length_mm mean 47.6 48.0 0.475
7 Adelie bill_length_mm sd 2.66 2.93 0.267
8 Chinstrap bill_length_mm sd 3.34 3.07 -0.273
9 Gentoo bill_length_mm sd 3.11 3.40 0.299
10 Adelie bill_length_mm skewness 0.156 0.505 0.349
# ℹ 50 more rows
# ℹ 1 more variable: proportion_difference <dbl>
Many utility metrics include a weight_var
argument to use weighted
statistics during calculation. For example, this code weights the
moments by the body weight of the penguins.
util_moments(
postsynth = penguins_postsynth,
data = penguins_conf,
weight_var = body_mass_g
)
# A tibble: 20 × 6
variable statistic original synthetic difference proportion_difference
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 bill_length_mm count 1.40e+6 1.39e+6 -1.5 e+4 -0.0107
2 bill_length_mm mean 4.46e+1 4.41e+1 -4.71e-1 -0.0106
3 bill_length_mm sd 5.38e+0 5.54e+0 1.60e-1 0.0297
4 bill_length_mm skewness -7.39e-2 -5.74e-2 1.64e-2 -0.222
5 bill_length_mm kurtosis -7.89e-1 -9.15e-1 -1.26e-1 0.160
6 bill_depth_mm count 1.40e+6 1.39e+6 -1.5 e+4 -0.0107
7 bill_depth_mm mean 1.70e+1 1.71e+1 1.28e-1 0.00754
8 bill_depth_mm sd 2.01e+0 1.94e+0 -7.07e-2 -0.0351
9 bill_depth_mm skewness 1.05e-2 -1.37e-1 -1.47e-1 -14.0
10 bill_depth_mm kurtosis -1.00e+0 -9.17e-1 8.70e-2 -0.0867
11 flipper_length… count 1.40e+6 1.39e+6 -1.5 e+4 -0.0107
12 flipper_length… mean 2.03e+2 2.01e+2 -1.97e+0 -0.00970
13 flipper_length… sd 1.44e+1 1.46e+1 1.60e-1 0.0111
14 flipper_length… skewness 1.48e-1 3.99e-1 2.51e-1 1.70
15 flipper_length… kurtosis -1.14e+0 -1.04e+0 1.08e-1 -0.0940
16 body_mass_g count 1.40e+6 1.39e+6 -1.5 e+4 -0.0107
17 body_mass_g mean 4.36e+3 4.31e+3 -5.19e+1 -0.0119
18 body_mass_g sd 8.26e+2 8.17e+2 -9.84e+0 -0.0119
19 body_mass_g skewness 2.69e-1 4.64e-1 1.95e-1 0.724
20 body_mass_g kurtosis -9.70e-1 -7.26e-1 2.44e-1 -0.251
Most commonly, weight_var
is used when synthesizing data from surveys.
Contact Aaron R. Williams with feedback or questions.