Abstract • Repository Description • Bias Induction via Hierarchy Bias • Bias Mitigation via DCAST • Installation and Dependencies • Dataset Generation • Direct Usage of DCAST • Code Integration • Bibtext Reference • License
Fairness in machine learning seeks to mitigate bias against individuals based on sensitive features such as gender or income, wherein selection bias is a common cause of suboptimal models for underrepresented profiles. Notably, bias unrelated to sensitive features often goes undiagnosed, while it is prominent in complex high-dimensional settings from fields like computer vision and molecular biomedicine. Strategies to mitigate unidentified bias and induce bias for evaluation of mitigation methods are crucially needed, yet remain underexplored. We introduce: (i) Diverse Class-Aware Self-Training (DCAST), model-agnostic mitigation aware of class-specific bias, which promotes sample diversity to counter confirmation bias of conventional self-training while leveraging unlabeled samples for an improved representation of the real-world distribution; (ii) hierarchy bias, for multivariate and class-aware bias induction without prior knowledge. Models learned with DCAST showed improved robustness to hierarchy and other biases across eleven real-world datasets, against conventional self-training and six prominent domain adaptation techniques. Advantage was largest for higher-dimensional datasets, suggesting DCAST as a promising strategy to achieve fairer learning. (Published in ... DCAST: Diverse Class-Aware Self-Training Mitigates Selection Bias for Fairer Learning) |
Hierarchy bias approach for induction of selection bias.
Given input data
Diverse Class-Aware Self-Training (DCAST) framework.
(Left) Input to DCAST.
Labeled data
- data: Includes all the data for feature generation or experiments.
- datasets: Includes different types of tabular datasets.
- graphs: Data folder for graphs.
- logs: Log files for each of the experiment and feature generation.
- results: json model files and csv experiment performance result files.
- src: Includes all the codes for the application.
- analysis: Includes the codes for visualization and reporting the results.
- datasets: Includes codes to process graph data.
- test: Includes the files for the main experiments.
- lib: Includes some utility codes and common visual functions for project.
- models: Includes DCAST model for logistic regression (LR), neural network (NN), and regularized random forest (RRF). For NN and RRF, it also includes Feature-space level diversity for models (FSD).
- bias_techniques.py: File to call functions of different bias induction methods.
-
- config.py: File that loads the settings of the project.
-
- load_dataset.py: File to load datasets.
- Python3.9
- Packages: conda_requirements.txt
- Open terminal and go to the folder to install the project
- Clone the project:
git clone https://github.com/joanagoncalveslab/DCAST.git
- Enter project and install requirements:
cd DCAST | conda create --name DCAST --file conda_requirements.txt
from src import load_dataset as ld
dataset_name='breast_cancer'
dataset_args = {}
x_train, y_train, x_test, y_test = ld.load_dataset(dataset_name, **dataset_args, test=True)
- Induce hierarchy bias with 0.9 strength by choosing 30 samples from each class.
from src import bias_techniques as bt
bias_params = {'max_size': 30, 'prob': 0.9}
selections = bt.get_bias('hierarchy', X, y, **bias_params).astype(int)
X_biased, y_biased = X[selections, :], y[selections]
- Induce custom bias with bias arguments.
from src import bias_techniques as bt
selections = bt.get_bias(bias_name, X, **bias_params).astype(int)
X_biased, y_biased = X[selections, :], y[selections]
python3.6 src/test/nn_bias_test_nb_imb2seed.py --bias hierarchy_0.9 --dataset breast_cancer --bias_size 30 -es
You can see the saved model and results in results_nn_test_nb_imb_ss/bias_name/dataset_name
folder.
@article{tepeli2024DCAST,
title={DCAST: Diverse Class-Aware Self-Training Mitigates Selection Bias for Fairer Learning},
author={Tepeli, Yasin I. and Goncalves, Joana P.},
journal={XXX},
volume={00},
number={0},
pages={000},
year={2021},
publisher={Springer Science and Business Media {LLC}}
}
- Copyright © [ytepeli].