This R package is a simple, user-friendly tool for train-test splitting and k-fold cross-validation for classification data using various classification algorithms from popular R packages. The functions used from packages for each classification algorithms:
lda()
from MASS package for Linear Discriminant Analysisqda()
from MASS package for Quadratic Discriminant Analysisglm()
from base package with family = "binomial" for Logistic Regressionsvm()
from e1071 package for Support Vector Machinesnaive_bayes()
from naivebayes package for Naive Bayesnnet()
from nnet package for Artificial Neural Networktrain.kknn()
from kknn package for K-Nearest Neighborsrpart()
from rpart package for Decision TreesrandomForest()
from randomForest package for Random Forestmultinom()
from nnet package for Multinomial Regressionxgb.train()
from xgboost package for Gradient Boosting Machines
- Versatile Data Splitting: Perform train-test splits or k-fold cross-validation on your classification data.
- Support for Popular Algorithms: Choose from a wide range of classification algorithms such as Linear Discriminant Analysis, Quadratic Discriminant Analysis, Logistic Regression, Support Vector Machines, Naive Bayes, Artificial Neural Networks, K-Nearest Neighbors, Decision Trees, Random Forest, Multinomial Logistic Regression, and Gradient Boosting Machines. Additionally, multiple algorithms can be specified in a single function call.
- Stratified Sampling Option: Ensure representative class distribution using stratified sampling based on class proportions.
- Handling Unseen Categorical Levels: Automatically exclude observations from the validation/test set with categories not seen during model training. This is particularly helpful for specific algorithms that might throw errors in such cases.
- Model Saving Capabilities: Save all models utilized for training and testing.
- Dataset Saving Options: Preserve split datasets and folds.
- Model Creation: Easily create and save final models.
- Missing Data Imputation: Choose from two imputation methods - Bagged Tree Imputation and KNN Imputation. These two methods use the
step_bag_impute()
andstep_knn_impute()
functions from the recipes package, respectively. The recipes package is used to create an imputation model using the training data to predict missing data in the predictors for both the training data and the validation data. This is done to prevent data leakage. Rows with missing target variables are removed and the target is removed from being a predictor during imputation. - Model Creation: Easily create and save final models.
- Performance Metrics: View performance metrics in the console and generate/save plots for key metrics, including overall classification accuracy, as well as f-score, precision, and recall for each class in the target variable across train-test split and k-fold cross-validation.
- Automatic Numerical Encoding: Classes within the target variable are automatically numerically encoded for algorithms such as Logistic Regression and Gradient Boosted Models that require numerical inputs for the target variable.
- Parallel Processing: Specify the
n_cores
andfuture.seed
parameters inparallel_configs
to specify the number of cores for parallel processing to process multiple folds simultaneously. Only available when cross validation is specified. - Minimal Code Requirement: Access desired information quickly and efficiently with just a few lines of code.
# Install 'remotes' to install packages from Github
install.packages("remotes")
# Install 'vswift' package
remotes::install_github("donishadsmith/vswift/pkg/vswift", ref="main")
# Display documentation for the 'vswift' package
help(package = "vswift")
# Install 'remotes' to install packages from Github
install.packages("remotes")
# Install 'vswift' package
remotes::install_url("https://github.com/donishadsmith/vswift/releases/download/0.2.5/vswift_0.2.5.tar.gz")
# Display documentation for the 'vswift' package
help(package = "vswift")
The type of classification algorithm is specified using the models
parameter in the classCV()
function.
Acceptable inputs for the models
parameter includes:
- "lda" for Linear Discriminant Analysis
- "qda" for Quadratic Discriminant Analysis
- "logistic" for Logistic Regression
- "svm" for Support Vector Machines
- "naivebayes" for Naive Bayes
- "ann" for Artificial Neural Network
- "knn" for K-Nearest Neighbors
- "decisiontree" for Decision Trees
- "randomforest" for Random Forest
- "multinom" for Multinomial Regression
- "gbm" for Gradient Boosting Machines
# Load the package
library(vswift)
# Perform train-test split and k-fold cross-validation with stratified sampling
results <- classCV(data = iris,
target = "Species",
models = "lda",
train_params = list(split = 0.8, n_folds = 5, stratified = TRUE, random_seed = 50)
)
# Also valid; the target variable can refer to the column index
results <- classCV(data = iris,
target = 5,
models = "lda",
train_params = list(split = 0.8, n_folds = 5, stratified = TRUE, random_seed = 50))
# Using formula method is also valid
results <- classCV(formula = Species ~ .,
data = iris,
models = "lda",
train_params = list(split = 0.8, n_folds = 5, stratified = TRUE, random_seed = 50))
classCV()
produces a vswift object which can be used for custom printing and plotting of performance metrics by using the print()
and plot()
functions.
class(results)
Output
[1] "vswift"
# Print parameter information and model evaluation metrics
print(results, parameters = TRUE, metrics = TRUE)
Output:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model: Linear Discriminant Analysis
Formula: Species ~ .
Number of Features: 4
Classes: setosa, versicolor, virginica
Training Parameters: list(split = 0.8, n_folds = 5, stratified = TRUE, random_seed = 50, standardize = FALSE, remove_obs = FALSE)
Model Parameters: list(map_args = NULL, final_model = FALSE)
Missing Data: 0
Effective Sample Size: 150
Imputation Parameters: list(method = NULL, args = NULL)
Parallel Configs: list(n_cores = NULL, future.seed = NULL)
Training
_ _ _ _ _ _ _ _
Classification Accuracy: 0.98
Class: Precision: Recall: F-Score:
setosa 1.00 1.00 1.00
versicolor 1.00 0.95 0.97
virginica 0.95 1.00 0.98
Test
_ _ _ _
Classification Accuracy: 0.97
Class: Precision: Recall: F-Score:
setosa 1.00 1.00 1.00
versicolor 0.91 1.00 0.95
virginica 1.00 0.90 0.95
K-fold CV
_ _ _ _ _ _ _ _ _
Average Classification Accuracy: 0.98 (0.04)
Class: Average Precision: Average Recall: Average F-score:
setosa 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)
versicolor 0.98 (0.05) 0.96 (0.09) 0.97 (0.07)
virginica 0.96 (0.08) 0.98 (0.04) 0.97 (0.06)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Plot model evaluation metrics
plot(results, split = TRUE, cv = TRUE, save_plots = TRUE, path = getwd())
The number of predictors can be modified using the predictors
or formula
parameters:
# Using knn on iris dataset, using the first, third, and fourth columns as predictors. Also, adding an additional argument, `ks = 5`, which is used in train.kknn() from kknn package
results <- classCV(data = iris,
target = "Species",
predictors = c("Sepal.Length","Petal.Length","Petal.Width"),
models = "knn",
train_params = list(split = 0.8, n_folds = 5, stratified = TRUE, random_seed = 50),
ks = 5)
# All configurations below are valid and will produce the same output
args <- list(knn = list(ks = 5))
results <- classCV(data = iris,
target = 5,
predictors = c(1,3,4),
models = "knn",
train_params = list(split = 0.8, n_folds = 5, stratified = TRUE, random_seed = 50),
model_params = list(map_args = args))
results <- classCV(formula = Species ~ Sepal.Length + Petal.Length + Petal.Width,
data = iris,
models = "knn",
train_params = list(split = 0.8, n_folds = 5, stratified = TRUE, random_seed = 50),
ks = 5)
print(results)
Output
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model: K-Nearest Neighbors
Formula: Species ~ Sepal.Length + Petal.Length + Petal.Width
Number of Features: 3
Classes: setosa, versicolor, virginica
Training Parameters: list(split = 0.8, n_folds = 5, stratified = TRUE, random_seed = 50, standardize = FALSE, remove_obs = FALSE)
Model Parameters: list(map_args = list(knn = list(ks = 5)), final_model = FALSE)
Missing Data: 0
Effective Sample Size: 150
Imputation Parameters: list(method = NULL, args = NULL)
Parallel Configs: list(n_cores = NULL, future.seed = NULL)
Training
_ _ _ _ _ _ _ _
Classification Accuracy: 0.97
Class: Precision: Recall: F-Score:
setosa 1.00 1.00 1.00
versicolor 0.95 0.95 0.95
virginica 0.95 0.95 0.95
Test
_ _ _ _
Classification Accuracy: 0.97
Class: Precision: Recall: F-Score:
setosa 1.00 1.00 1.00
versicolor 0.91 1.00 0.95
virginica 1.00 0.90 0.95
K-fold CV
_ _ _ _ _ _ _ _ _
Average Classification Accuracy: 0.96 (0.05)
Class: Average Precision: Average Recall: Average F-score:
setosa 1.00 (0.00) 1.00 (0.00) 1.00 (0.00)
versicolor 0.92 (0.08) 0.96 (0.09) 0.94 (0.08)
virginica 0.96 (0.09) 0.92 (0.08) 0.94 (0.08)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Displaying what is contained in the vswift object by converting its class to a list and using R's base print()
function.
class(results) <- "list"
print(results)
Output
```
$configs
$configs$formula
Species ~ Sepal.Length + Petal.Length + Petal.Width
$configs$n_features
[1] 3
$configs$models
[1] "knn"
$configs$model_params
$configs$model_params$map_args
$configs$model_params$map_args$knn
$configs$model_params$map_args$knn$ks
[1] 5
$configs$model_params$final_model
[1] FALSE
$configs$model_params$logistic_threshold
NULL
$configs$train_params
$configs$train_params$split
[1] 0.8
$configs$train_params$n_folds
[1] 5
$configs$train_params$stratified
[1] TRUE
$configs$train_params$random_seed
[1] 50
$configs$train_params$standardize
[1] FALSE
$configs$train_params$remove_obs
[1] FALSE
$configs$missing_data
[1] 0
$configs$effective_sample_size
[1] 150
$configs$impute_params
$configs$impute_params$method
NULL
$configs$impute_params$args
NULL
$configs$parallel_configs
$configs$parallel_configs$n_cores
NULL
$configs$parallel_configs$future.seed
NULL
$configs$save
$configs$save$models
[1] FALSE
$configs$save$data
[1] FALSE
$class_summary
$class_summary$classes
[1] "setosa" "versicolor" "virginica"
$class_summary$proportions
target_vector
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
$class_summary$indices
$class_summary$indices$setosa
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
$class_summary$indices$versicolor
[1] 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94
[45] 95 96 97 98 99 100
$class_summary$indices$virginica
[1] 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[45] 145 146 147 148 149 150
$data_partitions
$data_partitions$indices
$data_partitions$indices$split
$data_partitions$indices$split$train
[1] 48 11 31 50 46 3 8 16 18 27 21 41 20 37 34 7 28 29 26 10 25 13 2 30 36 15 47 49 35 40 12 42 4 6 22 44 17 5 39 33 69 66 61 70
[45] 81 74 88 93 91 87 56 63 52 55 73 72 80 97 62 94 84 86 65 99 98 53 57 58 90 51 96 75 60 78 92 59 89 85 79 71 130 128 131 133 115 150 124 144
[89] 125 123 110 138 119 101 132 111 143 112 145 139 104 102 121 140 127 105 136 135 103 122 109 141 120 117 113 107 126 118 148 114
$data_partitions$indices$split$test
[1] 19 1 24 45 9 43 32 38 23 14 83 67 54 64 82 95 68 76 77 100 134 146 116 149 106 147 142 129 108 137
$data_partitions$indices$cv
$data_partitions$indices$cv$fold1
[1] 48 11 31 50 46 3 8 16 18 27 71 77 70 87 84 57 78 79 76 60 132 113 134 125 148 131 110 143 107 121
$data_partitions$indices$cv$fold2
[1] 23 5 29 7 10 44 21 47 33 22 98 51 68 72 67 73 100 63 74 75 139 129 147 146 106 116 102 105 115 128
$data_partitions$indices$cv$fold3
[1] 34 37 40 35 20 2 38 26 28 19 94 99 54 59 61 58 52 53 88 96 118 124 109 141 137 140 127 104 117 103
$data_partitions$indices$cv$fold4
[1] 4 9 25 49 6 36 30 12 1 14 83 93 82 66 62 56 55 97 80 91 112 108 126 145 114 150 130 111 135 119
$data_partitions$indices$cv$fold5
[1] 43 39 13 42 41 32 24 17 15 45 95 85 90 69 92 64 89 81 86 65 142 123 120 122 133 149 144 138 101 136
$data_partitions$proportions
$data_partitions$proportions$split
$data_partitions$proportions$split$train
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
$data_partitions$proportions$split$test
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
$data_partitions$proportions$cv
$data_partitions$proportions$cv$fold1
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
$data_partitions$proportions$cv$fold2
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
$data_partitions$proportions$cv$fold3
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
$data_partitions$proportions$cv$fold4
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
$data_partitions$proportions$cv$fold5
setosa versicolor virginica
0.3333333 0.3333333 0.3333333
$metrics
$metrics$knn
$metrics$knn$split
Set Classification Accuracy Class: setosa Precision Class: setosa Recall Class: setosa F-Score Class: versicolor Precision Class: versicolor Recall Class: versicolor F-Score
1 Training 0.9666667 1 1 1 0.9500000 0.95 0.950000
2 Test 0.9666667 1 1 1 0.9090909 1.00 0.952381
Class: virginica Precision Class: virginica Recall Class: virginica F-Score
1 0.95 0.95 0.9500000
2 1.00 0.90 0.9473684
$metrics$knn$cv
Fold Classification Accuracy Class: setosa Precision Class: setosa Recall Class: setosa F-Score Class: versicolor Precision Class: versicolor Recall
1 Fold 1 0.86666667 1 1 1 0.80000000 0.80000000
2 Fold 2 0.96666667 1 1 1 0.90909091 1.00000000
3 Fold 3 1.00000000 1 1 1 1.00000000 1.00000000
4 Fold 4 1.00000000 1 1 1 1.00000000 1.00000000
5 Fold 5 0.96666667 1 1 1 0.90909091 1.00000000
6 Mean CV: 0.96000000 1 1 1 0.92363636 0.96000000
7 Standard Deviation CV: 0.05477226 0 0 0 0.08272228 0.08944272
8 Standard Error CV: 0.02449490 0 0 0 0.03699453 0.04000000
Class: versicolor F-Score Class: virginica Precision Class: virginica Recall Class: virginica F-Score
1 0.80000000 0.80000000 0.80000000 0.80000000
2 0.95238095 1.00000000 0.90000000 0.94736842
3 1.00000000 1.00000000 1.00000000 1.00000000
4 1.00000000 1.00000000 1.00000000 1.00000000
5 0.95238095 1.00000000 0.90000000 0.94736842
6 0.94095238 0.96000000 0.92000000 0.93894737
7 0.08231349 0.08944272 0.08366600 0.08201074
8 0.03681171 0.04000000 0.03741657 0.03667632
```
Note: This example uses the internet advertisement data from the UCI Machine Learning Repository.
# Set url for interet advertisement data from UCI Machine Learning Repository. This data has 3,278 instances and 1558 attributes.
url <- "https://archive.ics.uci.edu/static/public/51/internet+advertisements.zip"
# Set file destination
dest_file <- file.path(getwd(),"ad.zip")
# Download zip file
download.file(url,dest_file)
# Unzip file
unzip(zipfile = dest_file , files = "ad.data")
# Read data
ad_data <- read.csv("ad.data")
# Load in vswift
library(vswift)
# Create arguments variable to tune parameters for multiple models
args <- list("knn" = list(ks = 5),
"gbm" = list(params = list(booster = "gbtree", objective = "reg:logistic",
lambda = 0.0003, alpha = 0.0003, eta = 0.8,
max_depth = 6), nrounds = 10))
print("Without Parallel Processing:")
# Obtain new start time
start <- proc.time()
# Run the same model without parallel processing
results <- classCV(data = ad_data,
target = "ad.",
models = c("knn","svm","decisiontree","gbm"),
train_params = list(split = 0.8, n_folds = 5, random_seed = 50),
model_params = list(map_args = args)
)
# Get end time
end <- proc.time() - start
# Print time
print(end)
print("Parallel Processing:")
# Adjust maximum object size that can be passed to workers during parallel processing; ~1.2 gb
options(future.globals.maxSize = 1200 * 1024^2)
# Obtain start time
start_par <- proc.time()
# Run model using parallel processing with 4 cores
results <- classCV(data = ad_data,
target = "ad.",
models = c("knn","svm","decisiontree","gbm"),
train_params = list(split = 0.8, n_folds = 5, random_seed = 50),
model_params = list(map_args = args),
parallel_configs = list(n_cores = 4, future.seed = 100)
)
# Obtain end time
end_par <- proc.time() - start_par
# Print time
print(end_par)
Output:
[1] "Without Parallel Processing:"
Warning message:
In .create_dictionary(preprocessed_data = preprocessed_data, :
classes are now encoded: ad. = 0, nonad. = 1
user system elapsed
202.89 1.20 212.01
[1] "Parallel Processing:"
Warning message:
In .create_dictionary(preprocessed_data = preprocessed_data, :
classes are now encoded: ad. = 0, nonad. = 1
user system elapsed
1.83 8.83 142.35
# Print parameter information and model evaluation metrics; If number of features > 20, the tartget replaces the formula
print(results, models = c("gbm", "knn"))
Output:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model: Gradient Boosted Machine
Target: ad.
Number of Features: 1558
Classes: ad., nonad.
Training Parameters: list(split = 0.8, n_folds = 5, random_seed = 50, stratified = FALSE, standardize = FALSE, remove_obs = FALSE)
Model Parameters: list(map_args = list(gbm = list(params = list(booster = "gbtree", objective = "reg:logistic", lambda = 3e-04, alpha = 3e-04, eta = 0.8, max_depth = 6), nrounds = 10)), logistic_threshold = 0.5, final_model = FALSE)
Missing Data: 0
Effective Sample Size: 3278
Imputation Parameters: list(method = NULL, args = NULL)
Parallel Configs: list(n_cores = 4, future.seed = 100)
Training
_ _ _ _ _ _ _ _
Classification Accuracy: 0.99
Class: Precision: Recall: F-Score:
ad. 0.99 0.93 0.96
nonad. 0.99 1.00 0.99
Test
_ _ _ _
Classification Accuracy: 0.98
Class: Precision: Recall: F-Score:
ad. 0.97 0.89 0.93
nonad. 0.98 0.99 0.99
K-fold CV
_ _ _ _ _ _ _ _ _
Average Classification Accuracy: 0.98 (0.01)
Class: Average Precision: Average Recall: Average F-score:
ad. 0.95 (0.02) 0.88 (0.04) 0.91 (0.03)
nonad. 0.98 (0.01) 0.99 (0.00) 0.99 (0.00)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Model: K-Nearest Neighbors
Target: ad.
Number of Features: 1558
Classes: ad., nonad.
Training Parameters: list(split = 0.8, n_folds = 5, random_seed = 50, stratified = FALSE, standardize = FALSE, remove_obs = FALSE)
Model Parameters: list(map_args = list(knn = list(ks = 5)), logistic_threshold = 0.5, final_model = FALSE)
Missing Data: 0
Effective Sample Size: 3278
Imputation Parameters: list(method = NULL, args = NULL)
Parallel Configs: list(n_cores = 4, future.seed = 100)
Training
_ _ _ _ _ _ _ _
Classification Accuracy: 1.00
Class: Precision: Recall: F-Score:
ad. 1.00 0.99 1.00
nonad. 1.00 1.00 1.00
Test
_ _ _ _
Classification Accuracy: 0.96
Class: Precision: Recall: F-Score:
ad. 0.89 0.80 0.84
nonad. 0.97 0.98 0.98
K-fold CV
_ _ _ _ _ _ _ _ _
Average Classification Accuracy: 0.93 (0.01)
Class: Average Precision: Average Recall: Average F-score:
ad. 0.71 (0.07) 0.82 (0.01) 0.76 (0.04)
nonad. 0.97 (0.00) 0.95 (0.02) 0.96 (0.01)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# Plot results
plot(results, models = "gbm" , save_plots = TRUE,
class_names = "ad.", metrics = c("precision", "recall"))
This package was initially inspired by topepo's caret package.