This is a list of papers about training data quality management for ML models.
Data scientists spend ∼80% time on data preparation for an ML pipeline since the data quality issues are unknown beforehand thereby leading to iterative debugging [1]. A good Data Quality Management System for ML (DQMS for ML) helps data scientists break free from the arduous process of data selection and debugging, particularly in the current era of big data and large models. Automating the management of training data quality effectively is crucial for improving the efficiency and quality of ML pipelines.
With the emergence and development of "Data-Centric AI", there has been increasing research focus on optimizing the quality of training data rather than solely concentrating on model structures and training strategies. This is the motivation behind maintaining this repository.
Before we proceed, let's define data quality for ML. In contrast to traditional data cleaning, training data quality for ML refers to the impact of individual or groups of data samples on the behavior of ML models for a given task. It's important to note that the behavior of the model we are concerned with goes beyond performance metrics like accuracy, recall, AUC, MSE, etc. We also consider more generalizable metrics such as model fairness, robustness, and so on.
Considering the following pipeline, DQMS acts as a middleware between data, ML model, and user, necessitating interactions with each of them.
A DQMS for ML typically consists of three components: Data Sculptor [2], Data Attributer, and Data Profiler. To achieve a well-performing ML model, multiple rounds of training are often required. In this process, the DQMS needs to iteratively adjust the training data based on the results of each round of model training. The workflow of DQMS in one round of training is as follows: (a) Data sculptor first acquires the training dataset from a data source and trains the ML model with it. (b) After training for one round (several epochs), Data Attributer absorbs feedback from the model and user's task requirements and computes the data quality assessment. (c) Data Profiler then provides a user-friendly summary of the training data. (d) Meanwhile, Data Sculptor utilizes the data quality assessment as feedback to acquire higher-quality training data, thus initiating a new iteration.
We collect the recent influential papers about DQMS for ML and annotate the relevant DQMS components involved in these papers, where DS
= Data Sculptor, DA
= Data Attributer, and DP
= Data Profiler. The following papers are listed in roughly chronological order of publication.
Venue | Paper | Links | Tags | TLDR |
---|---|---|---|---|
arXiv’24 | Towards Data Valuation via Asymmetric Data Shapley | paper code | DA |
|
arXiv'24 | Disentangled Structural and Featural Representation for Task-Agnostic Graph Valuation | paper | DA |
|
arXiv'24 | Distilling The Knowledge in Data Pruning | paper | DS |
|
Openreview | Harnessing Diversity for Important Data Selection in Pretraining Large Language Models | paper | DA |
|
Openreview | SAVA: Scalable Learning-Agnostic Data Valuation | paper | DA |
|
Openreview | Data Attribution for Multitask Learning | paper | DA |
|
Openreview | On the Inflation of KNN-Shapley Value | paper | DA |
|
Openreview | Data Valuation for Graphs | paper | DA |
|
Openreview | Precedence-Constrained Winter Value for Effective Graph Data Valuation | paper | DA |
|
Openreview | Data Shapley in One Training Run | paper | DA |
|
Openreview | Generalized Group Data Attribution | paper | DA |
|
Openreview | Top-m Data Values Identification | paper | DA |
|
NIPS'24 | Not All Tokens Are What You Need for Pretraining | paper | DS |
|
NIPS'24 | Stochastic Amortization: A Unified Approach to Accelerate Feature and Data Attribution | paper code code | DA |
|
NIPS'24 | Data Distribution Valuation | paper code | DA |
|
NIPS'24 | DU-Shapley: A Shapley Value Proxy for Efficient Dataset Valuation | paper | DA |
|
NIPS'24 | SHED: Shapley-Based Automated Dataset Refinement for Instruction Fine-Tuning | paper | DA |
It first divide the data into clusters and compute the Shapley value of clusters, and then select representative data points inside cluster. |
NIPS'24 | 2D-OOB: Attributing Data Contribution through Joint Valuation Framework | paper code | DA |
|
NIPS'24 | Training Data Attribution via Approximate Unrolling | paper | DA |
|
NIPS'24 | Data Attribution for Text-to-Image Models by Unlearning Synthesized Images | DA |
||
NIPS'24 | Efficient Sketches for Training Data Attribution and Studying the Loss Landscape | DA |
||
NIPS'24 | MATES: Model-Aware Data Selection for Efficient Pretraining with Data Influence Models | paper | DS |
|
ICDE'24 | When Data Pricing Meets Non-cooperative Game Theory | paper | DA |
|
arXiv'24 | Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via Data Selection | paper code |
DS DA
|
|
KDD'24 | EcoVal: An Efficient Data Valuation Framework for Machine Learning | paper code | DA |
For the efficiency of data valuation, this work first divide the data into clusters, and compute the cluster value with LOO. It then assign the individual data value inside a cluster through produce function. |
KDD'24 | Approximating Memorization Using Loss Surface Geometry for Dataset Pruning and Summarization | paper code |
DS DP
|
This paper shows memorization score is effective for data summarization / selection tasks, and proposes to approximate memorization with SGD. |
KDD'24 | Scalable Rule Lists Learning with Sampling | paper code | DP |
This work proposes to learn the approximately optimal rule set through sampling by preserving both accuracy and efficiency. |
KDD'24 | AIM: Attributing, Interpreting, Mitigating Data Unfairness | paper code | DP |
|
KDD'24 | CAT: Interpretable Concept-based Taylor Additive Models | paper code | ||
KDD'24 | Dataset Regeneration for Sequential Recommendation | paper code | DS |
|
arXiv'24 | What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions | paper | DA |
|
arXiv'24 | CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning | paper | DA |
|
ICML'24 | QuRating: Selecting High-Quality Data for Training Language Models | paper code | DS |
|
ICML'24 | Scaling Laws for the Value of Individual Data Points in Machine Learning | paper code | DA |
This work proposes individual scaling law for distinguishing how the marginal contribution of a data point varies as the dataset size growing. It then proposes two methods to estimate the individual scaling law. |
ICML'24 | Rethinking Data Shapley for Data Selection Tasks: Misleads and Merits | paper |
DA DS
|
In general cases without considering the structural assumptions of utility functions, Data Shapley’s performance in data selection tasks can be no better than that of random guessing. It proposes a heuristic for predicting Data Shapley’s optimality for data selection. |
ICML'24 | Incorporating Information into Shapley Values: Reweighting via a Maximum Entropy Approach | paper | DA |
|
ICML'24 | Distributionally Robust Data Valuation | paper code | DA |
|
ICML'24 | Helpful or Harmful Data? Fine-tuning-free Shapley Attribution for Explaining Language Model Predictions | paper code | DA |
It proves that Shapley value shows better robustness compared to LOO and proposes FreeShap to estimate Shapley using eNTK without retraining. |
ICML'24 | Efficient Adversarial Contrastive Learning via Robustness-Aware Coreset Selection | paper code | DA |
|
ICML'24 | Optimal Coresets for Low-Dimensional Geometric Median | paper | DS |
|
ICML'24 | No Dimensional Sampling Coresets for Classification | paper | DS |
|
ICML'24 | Coresets for Multiple |
paper | DS |
|
ICML'24 | Deletion-Anticipative Data Selection with a Limited Budget | paper | DS |
|
ICML'24 | Data-Efficient Learning via Clustering-Based Sensitivity Sampling: Foundation Models and Beyond | paper | DS |
|
ICML'24 | Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints | paper code | DS |
|
ICML'24 | Mind the Boundary: Coreset Selection via Reconstructing the Decision Boundary | paper | DS |
This work proposes to select a coreset that maintains the decision boundary of model trained on full dataset. It measures the distance between a sample to its nearest decision boundary and selects data based on this distance. |
ICML'24 | DsDm: Dataset Selection with Datamodels | paper | DS |
DsDm converts the data selection problem into loss minimization problem in target data. It then uses linear datamodel to approximate the loss mapping and select the bottom-k samples with smallest estimated loss. |
ICML'24 | BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges | DS |
||
ICML'24 | LESS: Selecting Influential Data for Targeted Instruction Tuning | paper code | DS |
|
ICML'24 | Exploiting Negative Samples: A Catalyst for Cohort Discovery in Healthcare Analytics | paper |
DA DP
|
This work proposes to leverage data Shapley value to value each data in negative sample, and employs manifold learning and clustering to find influential patterns in negative samples. |
CVPR'24 | The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes | paper | DS |
|
VLDB'24 | Counterfactual Explanation of Shapley Value in Data Coalitions | paper code | DA |
If the Shapley value of data owner A is higher than B, the counterfactual explanation aims to find a smallest subset of data in A that such that moving it from A to B makes Shapley value of A less than that of B. This work proposes a greedy based search to find the counterfactual. |
VLDB'24 | P-Shapley: Shapley Values on Probabilistic Classifiers | paper code | DA |
This paper introduces P-Shapley with raw probability (instead of accuracy) as utility function and proposes calibration function to enlarge the utility change when the predicted probability is high. |
VLDB'24 | MetaStore: Analyzing Deep Learning Meta-Data at Scale | paper |
DA DP
|
|
VLDB'24 | Optimizing Data Acquisition to Enhance Machine Learning Performance | paper code | DS |
|
VLDB'24 | MisDetect: Iterative Mislabel Detection using Early Loss | paper code | DA |
|
VLDB'24 | Outlier Summarization via Human Interpretable Rules | paper code | DP |
It trains a decision tree model to summarize the rule patterns of outliers. |
VLDB'24 | Chameleon: Foundation Models for Fairness-aware Multi-modal Data Augmentation to Enhance Coverage of Minorities | paper code | DS |
It uses generative AI for augmentation, ensuring that the generated data covering the original data distribution with a smallest size. |
VLDB'24 | DataPrice: An Interactive System for Pricing Datasets in Data Marketplaces | paper | DA |
|
SIGMOD'24 | Rock: Cleaning Data by Embedding ML in Logic Rules | paper |
DP DA
|
Rock implements a uniform data cleaning framework that unifies ML and logic deduction. |
SIGMOD'24 | Data Acquisition for Improving Model Confidence | paper | DS |
|
SIGMOD'24 | Controllable Tabular Data Synthesis Using Diffusion Models | paper | DS |
|
SIGMOD'24 | Fast Shapley Value Computation in Data Assemblage Tasks as Cooperative Simple Games | paper code | DA |
It assigns a Shapley score for data owners and their corresponding datasets in data market. |
WWW'24 | Exploring Neural Scaling Law and Data Pruning Methods For Node Classification on Large-scale Graphs | paper code | DS |
This work selects training nodes that are similar to test nodes by minimizing their bottleneck distance. To avoid bias caused by trivial selection, it uses a greedy alg. to assure the representativeness of selected nodes. |
AAAI'24 | DeRDaVa: Deletion-Robust Data Valuation for Machine Learning | paper | DA |
|
AAAI'24 | Quality-Diversity Generative Sampling for Learning with Synthetic Data | paper code | DS |
|
AAAI'24 | Approximating the Shapley Value without Marginal Contributions | paper | DA |
It transfer Shapley value by $\phi_i = \phi_i^+ + \phi_i^-$. It samples coalitions and update $\phi_i^+$ and $\phi_i^-$ separately. |
WSDM'24 | FairIF: Boosting Fairness in Deep Learning via Influence Functions with Validation Set Sensitive Attributes | paper | DA |
|
WSDM'24 | Efficient, Direct, and Restricted Black-Box Graph Evasion Attacks to Any-Layer Graph Neural Networks via Influence Function | paper code | DA |
|
ICLR'24 | "What Data Benefits My Classifier?" Enhancing Model Performance and Interpretability through Influence-Based Data Selection | paper code |
DS DA
|
It extends influence function considering utility, fairness and robustness. It trains a decision tree to further estimate and interpret the influence score. |
ICLR'24 | Canonpipe: Data Debugging with Shapley Importance over Machine Learning Pipelines | paper code | DA |
It explores data valuation on raw data before preprocessing. It uses data provenance in ML pipelines and proposes data Shapley under a KNN approximation. |
ICLR'24 | Time Travel in LLMs: Tracing Data Contamination in Large Language Models | paper code | DA |
Data contamination means the presence of test data from downstream tasks in the pre-training data of LLMs. This work explore both instance and partition level methods to identify potential contamination. |
ICLR'24 | GIO: Gradient Information Optimization for Training Dataset Selection | paper code | DA |
GIO selects a small subset of data from large source data by minimizing the KL divergence between the target distribution and subset. |
ICLR'24 | Intriguing Properties of Data Attribution on Diffusion Models | paper code | DA |
This paper proposes D-TRAK to attribute images generated by diffusion models back to the training data. |
ICLR'24 | D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning | paper code | DS |
A data pruning method that takes diversity into consideration. It is implemented by forward and reverse message passing in the KNN graph. |
ICLR'24 | Effective pruning of web-scale datasets based on complexity of concept clusters | paper code | DS |
|
ICLR'24 | Towards a statistical theory of data selection under weak supervision | paper | DS |
|
ICLR'24 | Get more for less: Principled Data Selection for Warming Up Fine-Tuning in LLMs | paper code | DS |
|
ICLR'24 | DataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models | paper code | DA |
DataInf approximate influence function by swapping the order of the matrix inversion and average calculation. |
ICLR'24 | What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning | paper | DS |
|
ICLR'24 | Real-Fake: Effective Training Data Synthesis Through Distribution Matching | paper code | DS |
|
ICLR'24 | InfoBatch: Lossless Training Speed Up by Unbiased Dynamic Data Pruning | paper code | DS |
InfoBatch uses training loss to prune well-learned samples in each epoch and estimate gradient distribution for unbiased learning. |
arXiv'24 | A Decade's Battle on Dataset Bias: Are We There Yet? | paper code | DS |
|
arXiv'24 | On the Cause of Unfairness: A Training Sample Perspective | paper | DA |
The fairness influence can be computed by replacing the training sample with its concept counterfactual sample. |
Venue | Paper | Links | Tags | TLDR |
---|---|---|---|---|
arXiv'23 | Accelerated Shapley Value Approximation for Data Evaluation | paper | DA |
Not all coalition sizes are evaluated, small coalitions may introduce noise and large ones may have little contributions. To estimate the effect of coalitions with size k, about O(1 / k^2) sample coalitions is sufficient. |
arXiv'23 | The Journey, Not the Destination: How Data Guides Diffusion Models | paper code | DA |
- |
NIPS'23 | The Memory Perturbation Equation: Understanding Model’s Sensitivity to Data | paper code |
DA DP
|
- |
NIPS'23 | Theoretical and Practical Perspectives on what Influence Functions Do | paper | DA |
This work discusses some problematic assumptions of IF. While most of them can be addressed, IF can predict perturbated param accurately for a limited amount of time-steps. |
NIPS'23 | Data Selection for Language Models via Importance Resampling | paper code |
DS DA
|
It selects data satisfying a target distribution from raw data by reducing KL divergence to the target over random selection. |
NIPS'23 | Model Shapley: Equitable Model Valuation with Black-box Access | paper code | DA |
- |
NIPS'23 | Threshold KNN-Shapley: A Linear-Time and Privacy-Friendly Approach to Data Valuation | paper | DA |
Extend KNN-Shapley while considering data privacy. |
NIPS'23 | GEX: A flexible method for approximating influence via Geometric Ensemble | paper code | DA |
- |
NIPS'23 | Efficient Data Subset Selection to Generalize Training Across Models: Transductive and Inductive Networks | paper code | DS |
- |
NIPS'23 | Data Pruning via Moving-one-Sample-out | paper | DS |
This work proposes a Moso score (similar to LOO) and an approximates it using gradient over all training epochs. |
NIPS'23 | Towards Free Data Selection with General-Purpose Models | paper code | DS |
- |
NIPS'23 | Towards Accelerated Model Training via Bayesian Data Selection | paper | DS |
- |
NIPS'23 | Robust Data Valuation with Weighted Banzhaf Values | paper | DA |
- |
NIPS'23 | UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models | paper | DS |
- |
NIPS'23 | Performance Scaling via Optimal Transport: Enabling Data Selection from Partially Revealed Sources | paper code | DS |
Given publicly known pilot data from different data sources, it returns the optimal combination of data sources. |
NIPS'23 | Robust Data Pruning under Label Noise via Maximizing Re-labeling Accuracy | paper code | DS |
- |
NIPS'23 | Spuriosity Rankings: Sorting Data to Measure and Mitigate Biases | paper | DS |
- |
NIPS'23 | Core-sets for Fair and Diverse Data Summarization | paper code |
DS DP
|
It selects a fixed size of coreset for different groups of data while preserving diversity. |
NIPS'23 | Retaining Beneficial Information from Detrimental Data for Neural Network Repair | paper | DS |
- |
NIPS'23 | Expanding Small-Scale Datasets with Guided Imagination | paper code | DS |
- |
NIPS'23 | Error Discovery By Clustering Influence Embeddings | paper code | DA |
This work cluster influence embedding (a low dimension of influence vector of training samples) for all test samples to summarize the prediction error. |
NIPS'23 | HiBug: On Human-Interpretable Model Debug | paper code |
DP DS
|
- |
NIPS'23 | Skill-it! A data-driven skills framework for understanding and training language models | paper code |
DP DS
|
- |
ICML'23 | Discover and Cure: Concept-aware Mitigation of Spurious Correlation | paper code |
DS DA
|
Discover spurious correlation from concept level and perform concept-based data augmentation to mitigate bias. |
ICML'23 | TRAK: Attributing Model Behavior at Scale | paper code | DA |
TRAK first defines a Newton approximation to estimate LOO for logistic regression and then extends it to NNs (including CLIP, mT5) by view them as the linear model acting on input gradient. |
ICML'23 | RGE: A Repulsive Graph Rectification for Node Classification via Influence | paper code | DA |
RGE identifies a group of negative edges that are most harmful for GNNs. It iteratively selects negative edges by their individual influence and prefers distant edges first. |
ICML'23 | Data-OOB: Out-of-bag Estimate as a Simple and Efficient Data Value | paper code | DA |
Data-OOB measures the average score when a datum (OOB data) is not selected in the bootstrap dataset. |
ICML'23 | 2D-Shapley: A Framework for Fragmented Data Valuation | paper | DA |
|
ICML'23 | Towards Sustainable Learning: Coresets for Data-efficient Deep Learning | paper code | DS |
- |
ICML'23 Workshop | Training on Thin Air: Improve Image Classification with Generated Data | paper | DS |
- |
ICML'23 Workshop | Dataset Interfaces: Diagnosing Model Failures Using Controllable Counterfactual Generation | paper code |
DA DS
|
- |
VLDB'23 | Equitable Data Valuation Meets the Right to Be Forgotten in Model Markets | paper code | DA |
- |
VLDB'23 | Computing Rule-Based Explanations by Leveraging Counterfactuals | paper code | DP |
- |
VLDB'23 | Data Collection and Quality Challenges for Deep Learning | paper |
DS DA
|
- |
SIGMOD'23 | GoodCore: Coreset Selection over Incomplete Data for Data-effective and Data-efficient Machine Learning | paper | DS |
GoodCore selects a coreset that achieves expected low gradient approximation error among all possible worlds of missing data. |
SIGMOD'23 | XInsight: eXplainable Data Analysis Through The Lens of Causality | paper | DP |
- |
SIGMOD'23 | HybridPipe: Combining Human-generated and Machine-generated Pipelines for Data Preparation | paper code |
DS DP
|
- |
SIGMOD'23 | Saga: A Scalable Framework for Optimizing Data Cleaning Pipelines for Machine Learning Applications | paper | ||
ACL'23 | Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values | paper | DA |
|
arXiv'23 | Simfluence: Modeling the influence of individual training examples by simulating training runs | paper | DS |
Trains a simulator that generates a time series that predicts what the loss on $z_{test}$ would be after each step of the training run (a loss trajectory). |
ICLR'23 | Data Valuation Without Training of a Model | paper code | DA |
It proposes a score to measures the gap in data complexity where a certain data instance is removed from the full dataset. |
ICLR'23 | Distilling Model Failures as Directions in Latent Space | paper code |
DS DP
|
- |
ICLR'23 | LAVA: Data Valuation without Pre-Specified Learning Algorithms | paper code | DA |
LAVA uses a Wasserstein distance to estimate the upper bound of test performance. It values a training sample by its sensitivity to the distance. |
ICLR'23 | Concept-level Debugging of Part-Prototype Networks | paper code | DP |
- |
ICLR'23 | Dataset Pruning: Reducing Training Data by Examining Generalization Influence | paper | DS |
- |
ICLR'23 | Moderate Coreset: A Universal Method of Data Selection for Real-world Data-efficient Deep Learning | paper code | DS |
- |
ICLR'23 | Learning to Estimate Shapley Values with Vision Transformers | paper code | DA |
- |
ICLR'23 | Characterizing the Influence of Graph Elements | paper code | DA |
Introduce influence function into graphs, considering node- and edge-removal influence and the linear SGC model. |
ICLR'23 | Dataset pruning: Reducing training data by examining generalization influence. | paper | DA |
|
ICDE'23 | Automatic Feasibility Study via Data Quality Analysis for ML: A Case-Study on Label Noise | paper code | DP |
- |
ICDE'23 | Detection of Groups with Biased Representation in Ranking | paper | DA |
- |
AAAI'23 | Fundamentals of Task-Agnostic Data Valuation | paper | DA |
- |
AAAI'23 | Interpreting Unfairness in Graph Neural Networks via Training Node Attribution | paper code | DA |
This work proposes a Probabilistic Distribution Disparity to define node-contributed model bias and use gradient approximation to estimate node-level bias. |
AAAI'23 | Interpreting Unfairness in Graph Neural Networks via Training Node Attribution | paper code | DA |
|
WWW'23 | GIF: A General Graph Unlearning Strategy via Influence Function | paper code | DA |
GIF extends influence function to graph data by considering both the directly affected node(s) and the influenced neighborhoods. |
AISTATS'23 | Data Banzhaf: A Robust Data Valuation Framework for Machine Learning | paper | DA |
- |
arXiv'23 | Data-Juicer: A One-Stop Data Processing System for Large Language Models | paper code |
DS DP
|
- |
arXiv'23 | Simfluence: Modeling the influence of individual training examples by simulating training runs | paper | DA |
|
arXiv'23 | Studying Large Language Model Generalization with Influence Functions | paper | DA |
- |
TMLR'23 | Synthetic Data from Diffusion Models Improves ImageNet Classification | paper | DS |
- |
Venue | Paper | Links | Tags | TLDR |
---|---|---|---|---|
NIPS'22 | CS-SHAPLEY: Class-wise Shapley Values for Data Valuation in Classification | paper code | DA |
|
NIPS'22 | Beyond neural scaling laws: beating power law scaling via data pruning | paper | DS |
|
NIPS'22 | Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP | paper code | DS |
|
NIPS'22 | Quantifying memorization across neural language models | paper | DA |
|
ICML'22 | Measuring the Effect of Training Data on Deep Learning Predictions via Randomized Experiments | paper | DA |
It proposes the AME score $E_S[U(S\cup {z})-U(S)]$ with $S$ being a random set. The AME score can be approximated by a LASSO model. |
ICML'22 | Meaningfully Debugging Model Mistakes using Conceptual Counterfactual Explanations | papeer code |
DS DP
|
It learns CAV and move those misclassified training samples toward the direction of CAV. |
ICML'22 | Datamodels: Predicting Predictions from Training Data | paper code | DA |
Datamodels learns a linear model to predict the model output on one test data. It takes as input the one-hot mask of training samples. |
ICML'22 | Prioritized Training on Points that are learnable, Worth Learning, and Not Yet Learnt | paper code | DS |
|
ICML'22 | Achieving Fairness at No Utility Cost via Data Reweighing with Influence | paper code | DA |
It employs DP and EOP to compute IF and performs soft reweighing on training samples. The proof of no-utility-degradation is provided. |
ICML'22 | DAVINZ: Data Valuation using Deep Neural Networks at Initialization | paper | DA |
It uses NTK-based bound to approximate validation performance without training. |
ICML'22 | Understanding Instance-Level Impact of Fairness Constraint | paper code | DA |
IF = IF of loss + IF of fairness constraint. It considers several constraints including DP, EOP, covariance, information, etc. and uses NTK to estimate IF. |
ICSE'22 | Training data debugging for the fairness of machine learning software | paper code | DS |
|
ICLR'22 | Domino: Discovering systematic errors with cross-modal embeddings | paper code |
DA DP
|
|
ICLR'22 | Improving Cooperative Game Theory-based Data Valuation via Data Utility Learning | paper | DA |
|
VLDB'22 | Toward Interpretable and Actionable Data Analysis with Explanations and Causality | paper | DP |
|
SIGMOD'22 | Complaint-Driven Training Data Debugging at Interactive Speeds | paper | DA |
|
SIGMOD'22 | Interpretable Data-Based Explanations for Fairness Debugging | paper video |
DA DP
|
|
ACL'22 | Deduplicating training data makes language models better | paper code | DS |
|
AAAI'22 | Scaling Up Influence Functions | paper code | DA |
|
AAAI'22 | Incentivizing collaboration in machine learning via synthetic data rewards | paper | DA |
|
AISTATS'22 | Beta Shapley: a Unified and Noise-reduced Data Valuation Framework for Machine Learning | paper code | DA |
Venue | Paper | Links | Tags | TLDR |
---|---|---|---|---|
NIPS'21 | Explaining Latent Representations with a Corpus of Examples | paper code | DA |
|
NIPS'21 | Validation free and replication robust volume-based data valuation | paper code | DA |
|
NIPS'21 | Deep Learning on a Data Diet: Finding Important Examples Early in Training | paper | DS |
|
NIPS21 | Interactive Label Cleaning with Example-based Explanations | paper code | DP |
|
ICML'21 | GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training | paper code | DS |
|
CVPR'21 | Scalability vs. Utility: Do We Have to Sacrifice One for the Other in Data Importance Quantification? | paper code | DA |
|
CHI'21 | Data-Centric Explanations: Explaining Training Data of Machine Learning Systems to Promote Transparency | paper | DP |
|
NIPS'20 | Multi-Stage Influence Function | paper | DA |
|
NIPS'20 | Estimating Training Data Influence by Tracing Gradient Descent | paper code | DA |
TracIn measures the influence of training batched samples during training by estimating the test loss change w.r.t. earlier epochs. |
ICML'20 | On second-order group influence functions for black-box predictions | paper | DA |
The influence score of a group = the sum of individual influence per sample + cross-dependencies among samples in the group. |
ICML'20 | Coresets for data-efficient training of machine learning models | paper code | DS |
|
ICML'20 | Optimizing Data Usage via Differentiable Rewards | paper | DS |
|
ICML'20 | Data Valuation using Reinforcement Learning | paper code | DA |
DVRL employs a learnable NN as data value estimator to select data samples during training and use a RL signal to update it. |
ICML'20 | Collaborative Machine Learning with Incentive-Aware Model Rewards | paper | DA |
|
ICLR'20 | Selection via proxy: Efficient data selection for deep learning | paper code | DS |
|
SIGMOD'20 | Complaint Driven Training Data Debugging for Query'2.0 | paper video | DA |
|
PMLR'20 | Identifying Statistical Bias in Dataset Replication | paper code | ||
NIPS'19 | Data Cleansing for Models Trained with SGD | paper code | DA |
The proposed SGD-Influence scales the influence estimation into SGD-base NNs without the convex and optimal assumptions. |
ICML'19 | Data Shapley: Equitable Valuation of Data for Machine Learning | paper code | DA |
|
VLDB'19 | Efficient task-specific data valuation for nearest neighbor algorithms | paper | DA |
|
AISTATS'19 | Towards Efficient Data Valuation Based on the Shapley Value | paper | DA |
|
ICML'17 | Understanding Black-box Predictions via Influence Functions | paper code | DA |
Venue | Paper | Links | Tags |
---|---|---|---|
arXiv'24 | A Survey on Data Selection for Language Models | paper | DS |
Nature Machine Intelligence'22 | Advances, challenges and opportunities in creating data for trustworthy AI | paper | DS DA |
arXiv'23 | Data-centric Artificial Intelligence: A Survey | paper | DS DA DP |
arXiv'23 | Data Management For Large Language Models: A Survey | paper code | DS DA |
arXiv'23 | Training Data Influence Analysis and Estimation: A Survey | paper code | DA |
TKDE'22 | Data Management for Machine Learning: A Survey | paper | DS DA |
IJCAI'21 | Data Valuation in Machine Learning: "Ingredients", Strategies, and Open Challenges | paper | DA |
TACL'21 | Explanation-Based Human Debugging of NLP Models: A Survey | paper | DP DA |
Venue | Paper | Links | Tags |
---|---|---|---|
NIPS'23 | DataPerf: Benchmarks for Data-Centric AI Development | paper code website | DS DA DP |
NIPS'23 | OpenDataVal: a Unified Benchmark for Data Valuation | paper code | DA |
NIPS'23 | Improving multimodal datasets with image captioning | paper code | DS |
NIPS'23 | Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias | paper code | DS |
DEEM'22 | dcbench: A Benchmark for Data-Centric AI Systems | paper code | DS |
- [ICML'23] DMLR Workshop: Data-centric Machine Learning Research video DMLR Website
- [NIPS'23] Tutorial: Data Contribution Estimation for Machine Learning Website
- More papers about Data Valuation can be found in awesome-data-valuation.
DA
- More papers about Data Pruning can be found in Awesome-Coreset-Selection.
DS
[1] Gupta, Nitin, et al. "Data quality for machine learning tasks." Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2021.
[2] Liang, Weixin, et al. "Advances, challenges and opportunities in creating data for trustworthy AI." Nature Machine Intelligence 4.8 (2022): 669-677.