Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Voting methods for feature ranking in efs #112

Merged
merged 75 commits into from
Nov 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
a5d1b38
add stability selection article
bblodfon Jul 31, 2024
4cc3815
add Rcpp code for approval voting feature ranking method
bblodfon Jul 31, 2024
21ae7d7
add citation
bblodfon Jul 31, 2024
ccffa4b
extra check during init()
bblodfon Jul 31, 2024
108ddc2
update doc + use the Rcpp interface for approval voting
bblodfon Jul 31, 2024
589df2e
add templates for params in ArchiveBatchFSelect + updocs
bblodfon Jul 31, 2024
e520c77
use testthat expectations (not checkmate ones!)
bblodfon Jul 31, 2024
0ecc618
add test for newly implemented voting methods
bblodfon Jul 31, 2024
2622c96
update test for av
bblodfon Jul 31, 2024
97f21c4
fix note
bblodfon Jul 31, 2024
f84f91c
refactor AV_rcpp, add SAV_rcpp
bblodfon Aug 1, 2024
3614d93
add norm_score, and SAV R function
bblodfon Aug 1, 2024
0a1eb49
add sav, improve doc
bblodfon Aug 1, 2024
fc5d24d
fix efs test
bblodfon Aug 1, 2024
6df3bbd
update and improve test for AV
bblodfon Aug 1, 2024
fc86503
add sav test
bblodfon Aug 1, 2024
0d9eccf
Merge branch 'main' into voting_methods
bblodfon Aug 7, 2024
87d68d4
add borda score
bblodfon Aug 7, 2024
fa05f09
update tests
bblodfon Aug 7, 2024
6a89966
add seq and revseq PAV Rcpp methods
bblodfon Aug 12, 2024
5c09975
add R functions for the PAV methods
bblodfon Aug 12, 2024
103bf45
comment printing
bblodfon Aug 12, 2024
ff17d11
add tests for PAV methods
bblodfon Aug 12, 2024
b6f4b5e
add PAV methods to efs
bblodfon Aug 12, 2024
3a248cf
refactor: do not use C++ RNGs
bblodfon Aug 13, 2024
92ce0df
fix startsWith
bblodfon Aug 13, 2024
283003e
updocs
bblodfon Aug 13, 2024
567f456
fix data.table note
bblodfon Aug 13, 2024
e55ae24
add committee_size parameter, refactor borda score
bblodfon Aug 19, 2024
9a37e60
add large data test for seq pav
bblodfon Aug 19, 2024
58ab928
refactor C++ code, add optimized PAV
bblodfon Aug 21, 2024
61c0907
remove revseq-PAV method, use optimized seqPAV
bblodfon Aug 21, 2024
8654a38
update tests
bblodfon Aug 21, 2024
47e3dcf
remove suboptimal seqPAV function
bblodfon Aug 23, 2024
b369c6e
shuffle candidates outside Rcpp functions (same tie-breaking)
bblodfon Aug 23, 2024
6b7fb03
optimize Phragmen a bit => do not randomly select the candidate with …
bblodfon Aug 23, 2024
60065f9
add phragmen's rule in efs
bblodfon Aug 23, 2024
8ffa44f
correct borda score + use phragmens rule
bblodfon Aug 23, 2024
852ff35
add tests for Phragmen's rule
bblodfon Aug 23, 2024
5623812
correct weighted Phragmen's rule
bblodfon Sep 18, 2024
7e3be3e
add specific test for phragmen's rule
bblodfon Sep 18, 2024
25387c4
Merge branch 'main' into voting_methods
bblodfon Sep 19, 2024
1eef6c6
run document()
bblodfon Sep 19, 2024
f2ccbda
show data.table result after using ':='
bblodfon Oct 17, 2024
bea5e39
add n_resamples field + nicer obj print
bblodfon Oct 17, 2024
2d21fc7
cover edge case (eg lasso resulted in no features getting selected)
bblodfon Oct 24, 2024
ad9fd2e
Merge branch 'main' into voting_methods
bblodfon Oct 25, 2024
7f3ab3b
updocs
bblodfon Oct 25, 2024
4137404
small styling fix
bblodfon Oct 25, 2024
d151303
add Stabl ref
bblodfon Oct 31, 2024
83529b6
more descriptive name
bblodfon Oct 31, 2024
49bb097
add embedded ensemble feature selection
bblodfon Oct 31, 2024
6f3923f
remove print()
bblodfon Nov 1, 2024
123624e
add TOCHECK comment on benchmark design
bblodfon Nov 5, 2024
0581cdc
use internal valid task
be-marc Nov 11, 2024
14acd73
simplify
be-marc Nov 11, 2024
81b475d
...
be-marc Nov 11, 2024
79747ad
store_models = FALSE
be-marc Nov 11, 2024
331f231
...
be-marc Nov 11, 2024
081acc8
separate the use of inner_measure and measure used in the test sets
bblodfon Nov 18, 2024
efc0155
updocs
bblodfon Nov 18, 2024
0e2f93f
update tests
bblodfon Nov 18, 2024
3bca203
Merge branch 'main' into voting_methods
bblodfon Nov 18, 2024
d457221
refactor: expect_vector => expect_numeric
bblodfon Nov 18, 2024
9cb56b1
fix partial arg match
bblodfon Nov 18, 2024
cc36179
fix example
bblodfon Nov 18, 2024
816376a
use fastVoteR for feature ranking
bblodfon Nov 23, 2024
3dae249
pass named list to callback parameter
be-marc Nov 25, 2024
fd5afbc
skip test if fastVoteR is not available
bblodfon Nov 25, 2024
c937024
refactor: better handling of inner measure
bblodfon Nov 26, 2024
8e506c8
add tests for embedded_ensemble_fselect()
bblodfon Nov 26, 2024
3bd1772
update NEWs
bblodfon Nov 26, 2024
9e05dca
add active_measure field
bblodfon Nov 26, 2024
832bd7f
remove Remotes as fastVoteR is now on CRAN :)
bblodfon Nov 27, 2024
8c0d73f
refine doc
bblodfon Nov 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,7 @@ Suggests:
mlr3learners,
mlr3pipelines,
rpart,
fastVoteR,
testthat (>= 3.0.0)
Config/testthat/edition: 3
Config/testthat/parallel: true
Expand Down Expand Up @@ -74,6 +75,7 @@ Collate:
'assertions.R'
'auto_fselector.R'
'bibentries.R'
'embedded_ensemble_fselect.R'
'ensemble_fselect.R'
'extract_inner_fselect_archives.R'
'extract_inner_fselect_results.R'
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ export(auto_fselector)
export(callback_batch_fselect)
export(clbk)
export(clbks)
export(embedded_ensemble_fselect)
export(ensemble_fselect)
export(extract_inner_fselect_archives)
export(extract_inner_fselect_results)
Expand Down
4 changes: 4 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# mlr3fselect (development version)

* Use [fastVoteR](https://github.com/bblodfon/fastVoteR) for feature ranking in `EnsembleFSResult()` objects
* Add embedded ensemble feature selection `embedded_ensemble_fselect()`
* Refactor `ensemble_fselect()` and `EnsembleFSResult()`

# mlr3fselect 1.2.1

* compatibility: mlr3 0.22.0
Expand Down
238 changes: 184 additions & 54 deletions R/EnsembleFSResult.R

Large diffs are not rendered by default.

41 changes: 29 additions & 12 deletions R/bibentries.R
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ bibentries = c(
title = "ecr 2.0",
booktitle = "Proceedings of the Genetic and Evolutionary Computation Conference Companion"
),

bergstra_2012 = bibentry("article",
title = "Random Search for Hyper-Parameter Optimization",
author = "James Bergstra and Yoshua Bengio",
Expand All @@ -20,8 +19,7 @@ bibentries = c(
pages = "281--305",
url = "https://jmlr.csail.mit.edu/papers/v13/bergstra12a.html"
),

thomas2017 = bibentry("article",
thomas2017 = bibentry("article",
doi = "10.1155/2017/1421409",
year = "2017",
publisher = "Hindawi Limited",
Expand All @@ -31,8 +29,7 @@ bibentries = c(
title = "Probing for Sparse and Fast Variable Selection with Model-Based Boosting",
journal = "Computational and Mathematical Methods in Medicine"
),

wu2007 = bibentry("article",
wu2007 = bibentry("article",
doi = "10.1198/016214506000000843",
year = "2007",
month = "3",
Expand All @@ -44,8 +41,7 @@ bibentries = c(
title = "Controlling Variable Selection by the Addition of Pseudovariables",
journal = "Journal of the American Statistical Association"
),

guyon2002 = bibentry("article",
guyon2002 = bibentry("article",
title = "Gene Selection for Cancer Classification using Support Vector Machines",
volume = "46",
issn = "1573-0565",
Expand All @@ -56,7 +52,6 @@ bibentries = c(
author = "Isabelle Guyon and Jason Weston and Stephen Barnhill and Vladimir Vapnik",
year = "2002"
),

kuhn2013 = bibentry("Inbook",
author = "Kuhn, Max and Johnson, Kjell",
chapter = "Over-Fitting and Model Tuning",
Expand All @@ -67,7 +62,6 @@ bibentries = c(
pages = "61--92",
isbn = "978-1-4614-6849-3"
),

saeys2008 = bibentry("article",
author = "Saeys, Yvan and Abeel, Thomas and Van De Peer, Yves",
doi = "10.1007/978-3-540-87481-2_21",
Expand All @@ -79,7 +73,6 @@ bibentries = c(
volume = "5212 LNAI",
year = "2008"
),

abeel2010 = bibentry("article",
author = "Abeel, Thomas and Helleputte, Thibault and Van de Peer, Yves and Dupont, Pierre and Saeys, Yvan",
doi = "10.1093/BIOINFORMATICS/BTP630",
Expand All @@ -92,7 +85,6 @@ bibentries = c(
volume = "26",
year = "2010"
),

pes2020 = bibentry("article",
author = "Pes, Barbara",
doi = "10.1007/s00521-019-04082-3",
Expand All @@ -106,7 +98,6 @@ bibentries = c(
volume = "32",
year = "2020"
),

das1999 = bibentry("article",
author = "Das, I",
issn = "09344373",
Expand All @@ -118,5 +109,31 @@ bibentries = c(
title = "On characterizing the 'knee' of the Pareto curve based on normal-boundary intersection",
volume = "18",
year = "1999"
),
meinshausen2010 = bibentry("article",
author = "Meinshausen, Nicolai and Buhlmann, Peter",
doi = "10.1111/J.1467-9868.2010.00740.X",
eprint = "0809.2932",
issn = "1369-7412",
journal = "Journal of the Royal Statistical Society Series B: Statistical Methodology",
month = "sep",
number = "4",
pages = "417--473",
publisher = "Oxford Academic",
title = "Stability Selection",
volume = "72",
year = "2010"
),
hedou2024 = bibentry("article",
author = "Hedou, Julien and Maric, Ivana and Bellan, Gregoire and Einhaus, Jakob and Gaudilliere, Dyani K. and Ladant, Francois Xavier and Verdonk, Franck and Stelzer, Ina A. and Feyaerts, Dorien and Tsai, Amy S. and Ganio, Edward A. and Sabayev, Maximilian and Gillard, Joshua and Amar, Jonas and Cambriel, Amelie and Oskotsky, Tomiko T. and Roldan, Alennie and Golob, Jonathan L. and Sirota, Marina and Bonham, Thomas A. and Sato, Masaki and Diop, Maigane and Durand, Xavier and Angst, Martin S. and Stevenson, David K. and Aghaeepour, Nima and Montanari, Andrea and Gaudilliere, Brice", #nolint
doi = "10.1038/s41587-023-02033-x",
issn = "1546-1696",
journal = "Nature Biotechnology 2024",
month = "jan",
pages = "1--13",
publisher = "Nature Publishing Group",
title = "Discovery of sparse, reliable omic biomarkers with Stabl",
url = "https://www.nature.com/articles/s41587-023-02033-x",
year = "2024"
)
)
112 changes: 112 additions & 0 deletions R/embedded_ensemble_fselect.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
#' @title Embedded Ensemble Feature Selection
#'
#' @include CallbackBatchFSelect.R
#'
#' @description
#' Ensemble feature selection using multiple learners.
#' The ensemble feature selection method is designed to identify the most predictive features from a given dataset by leveraging multiple machine learning models and resampling techniques.
#' Returns an [EnsembleFSResult].
#'
#' @details
#' The method begins by applying an initial resampling technique specified by the user, to create **multiple subsamples** from the original dataset (train/test splits).
#' This resampling process helps in generating diverse subsets of data for robust feature selection.
#'
#' For each subsample (train set) generated in the previous step, the method applies learners
#' that support **embedded feature selection**.
#' These learners are then scored on their ability to predict on the resampled
#' test sets, storing the selected features during training, for each
#' combination of subsample and learner.
#'
#' Results are stored in an [EnsembleFSResult].
#'
#' @param learners (list of [mlr3::Learner])\cr
#' The learners to be used for feature selection.
#' All learners must have the `selected_features` property, i.e. implement
#' embedded feature selection (e.g. regularized models).
#' @param init_resampling ([mlr3::Resampling])\cr
#' The initial resampling strategy of the data, from which each train set
#' will be passed on to the learners and each test set will be used for
#' prediction.
#' Can only be [mlr3::ResamplingSubsampling] or [mlr3::ResamplingBootstrap].
#' @param measure ([mlr3::Measure])\cr
#' The measure used to score each learner on the test sets generated by
#' `init_resampling`.
#' If `NULL`, default measure is used.
#' @param store_benchmark_result (`logical(1)`)\cr
#' Whether to store the benchmark result in [EnsembleFSResult] or not.
#'
#' @template param_task
#'
#' @returns an [EnsembleFSResult] object.
#'
#' @source
#' `r format_bib("meinshausen2010", "hedou2024")`
#' @export
#' @examples
#' \donttest{
#' eefsr = embedded_ensemble_fselect(
#' task = tsk("sonar"),
#' learners = lrns(c("classif.rpart", "classif.featureless")),
#' init_resampling = rsmp("subsampling", repeats = 5),
#' measure = msr("classif.ce")
#' )
#' eefsr
#' }
embedded_ensemble_fselect = function(
task,
learners,
init_resampling,
measure,
store_benchmark_result = TRUE
) {
assert_task(task)
assert_learners(as_learners(learners), task = task, properties = "selected_features")
assert_resampling(init_resampling)
assert_choice(class(init_resampling)[1], choices = c("ResamplingBootstrap", "ResamplingSubsampling"))
assert_measure(measure, task = task)
assert_flag(store_benchmark_result)

init_resampling$instantiate(task)

design = benchmark_grid(
tasks = task,
learners = learners,
resamplings = init_resampling
)

bmr = benchmark(design, store_models = TRUE)

trained_learners = bmr$score()$learner

# extract selected features
features = map(trained_learners, function(learner) {
learner$selected_features()
})

# extract n_features
n_features = map_int(features, length)

# extract scores on the test sets
scores = bmr$score(measure)

set(scores, j = "features", value = features)
set(scores, j = "n_features", value = n_features)
setnames(scores, "iteration", "resampling_iteration")

# remove R6 objects
set(scores, j = "learner", value = NULL)
set(scores, j = "task", value = NULL)
set(scores, j = "resampling", value = NULL)
set(scores, j = "prediction_test", value = NULL)
set(scores, j = "task_id", value = NULL)
set(scores, j = "nr", value = NULL)
set(scores, j = "resampling_id", value = NULL)
set(scores, j = "uhash", value = NULL)

EnsembleFSResult$new(
result = scores,
features = task$feature_names,
benchmark_result = if (store_benchmark_result) bmr,
measure = measure
)
}
Loading