This repository is the implementation of the method presented in "Minimax Risk Classifiers for Mislabeled Data: a Study on Patient Outcome Prediction Tasks".
The algorithm proposed in the paper provides an efficient method to learn from noisy labels and a robust method to evaluate the performance of the classifier, even in scenarios where clean test data are not available.
This algorithm can be used whether the transition matrix
Python
>= 3.6 (the code was developed usingPython
= 3.11)numpy
,scipy
,scikit-learn
,cvxpy
,mosek
,gurobipy
,pandas
,cleanlab
- Depending on the version of
Python
installed in your environment, you may need to installCMake
. If prompted, you can install it following the guidelines in Download CMake. - The implementation of the proposed algorithm based on CVXpy uses MOSEK optimizer, which requires a license. You can get a free academic license from here.
To install the required libraries do as follow:
- Install the standard libraries listed in the
requirements.txt
:pip install -r requirements.txt
- Run the following commands to install the paper's custom distribution of the MRCpy library:
cd MRCpy python3 setup.py install
The data_mortality
folder contains the ICU Mortality dataset, polished as explained in the associated paper. In particular, it provides two version of the dataset in CSV format:
mortality_alsocat.csv
: Contains the polished data with all the features.mortality_nocat.csv
: Contains the polished data without categorical variables.
The original dataset is available here (a login is needed to download it).
The datasets
folder contains Mammographic Mass datasets, as well as the additional ones mentioned in the Appendices of the paper.
NOTE: Please ensure that the folders remains in their current location within the parent directory. If you choose to relocate the folder, remember to update the file paths accordingly.
The files in the simulation_execution_files
folder contain the scripts to replicate the experiments of the paper.
Experiments to learn on noisy data and evaluate on clean test data are:
-
$T$ known experiments:-
run_ntrain_mrc.py
: performs training and evalution of NoisyMRC, NaiveMRC, and OracleMRC; -
run_ntrain_nata.py
: performs training and evalution of Noisy LR; -
run_ntrain_lr.py
: performs training and evalution of Naive LR, and Oracle LR; -
run_ntrain_cl.py
: performs training and evalution of the method CleanLearning.
-
-
$T$ unknown experiments:-
run_ntrain_mrcest.py
:performs training and evalution of NoisyMRC on$T$ estimated; -
run_ntrain_cleansed.py
: performs training and evalution of Cleansed MRC, and Cleansed LR.
-
Experiments to learn and evaluate on noisy data:
run_perfeval_mrc.py
: performs training of Noisy MRC and evalution of it with Minimax;run_perfeval_nata.py
: performs training of Noisy LR and evalution of it with ULE;run_perfeval_lrcleansed.py
: performs training of Cleansed LR and evalution of it with LE.
Inside the Python scripts listed above you can manually set various parameters for training. Among which:
-
lambda0
: (float, defalut: 1) Defines the parameter$\lambda_0$ of the MRCs. -
det
: (boolean, default: True) If set to True uses the determininstc rule in the MRCs.
categorical
: (boolean, default: True) If set ot True, reads the dataset containing also the categorical variables. Otherwise, read the dataset where they have been removed, keeping only the continuous features.balanced
: (boolean, default: True) If set to True, balaces the dataset (to have the same percentage of 0's and 1's labels).
-
r1
: (int, contraints: > 0, <50) Specifies the percentage of 0's labels mislabeled. In particular it specifies the value of the noise rate$\rho_1$ ($\rho_1=$r1$ /100$); -
r2
: (int, contraints: > 0, <50) Specifies the percentage of 1's labels mislabeled. In particular it specifies the value of the noise rate$\rho_2$ ($\rho_2=$r2$ /100$); -
Nrep
: (int) Specifies the number of repetitions of the simulation. -
nvector
: (numpy array) Specifies the training sizes to use.
To replicate the plots presented in the submission for ICU Mortality dataset, you will need to use the Python files named load_and_plot_***
. These files are designed to load the results and generate various plots. Below are instructions for reproducing specific figures:
- To reproduce Figure 1: run
load_and_plot_ntrain.py
. - To reproduce Figure 2: run
load_and_plot_boxplot.py
. - To reproduce Figure 3: run
load_and_plot_ntrain_est.py
. - To reproduce Figure 4: run
load_and_plot_ntrain_cleansed.py
.
Ensure that you have the necessary data before running these scripts.