Lump a numeric variable into categorical groups using ‘dumblump’ algorithm
- Sort numbers in ascending order
- For each number, check its distance from the previous number (the closest, lower number in dataset).
- If distance >= threshold, define a new group. If distance < threshold, ‘lump’ with the group of the previous number
Disadvantages of this method 1. You can get numbers of substantially different scales in a single group. E.g. If you have a set of numbers 1, 2, 3,4, 5, 6, 7 … 100000.
These will all be classified as a single group unless theres a ‘break’ of > threshold somewhere along. If this is not what you want, explore clustering methods
You can install the development version of dumblump like so:
#install.packages('remotes')
remotes::install_github('selkamand/dumblump')
This is a basic example which shows you how to solve a common problem:
library(dumblump)
unlumped <- c(1, 1, 2, 5,5 , 6, 1, 12, 12)
lumped <- dumblump(unlumped, threshold = 1)
data.frame(lumped, unlumped)
#> lumped unlumped
#> 1 Group 1 1
#> 2 Group 1 1
#> 3 Group 2 2
#> 4 Group 3 5
#> 5 Group 3 5
#> 6 Group 4 6
#> 7 Group 1 1
#> 8 Group 5 12
#> 9 Group 5 12