This project is about data leaks in Machine Learning/Data Science (ML/DS) systems that occurs due to errors in experiment design, data preparation, data modeling which can affect the final predictive system by lowering its generalization capabilities and/or skewing performance estimations. Simply put, such leaks can lead to inflated performance metrics for models that actually have lower predictive power when deployed.
Out of scope:
- SecurityOps: Breaches and raw data exposure e.g. unsecured accounts
- Membership inference, reference, popultion
- ML competition specific metric probing/abuses and platform related abuses
Adversarial prompt attacks are in scope.
The goal of this project is to provide a comprehensive table for practitioners of potential data leaks in ML/DS systems, along with best practices for avoiding them, quick-fix examples, and, where possible, tests to check if data is affected by such leaks. The cases are sourced from both competitions and practical scenarios, with links to discussions or sources provided where possible. The material about errors in DS research papers is in Leakage and the reproducibility crisis in machine-learning-based science Sayash Kapoor, Arvind Narayanan it concentrates more on taxonomy without technical details like SIFTs in this repository.
./src/leakage_tests/ contains function in python with assert statments free of binding to particular test library. Modification to your own case are meant.
Some of ./cases/*.md contains examples of particular lines replacment to exterminate the data leak.
id | name and detail link | effect | symptom | stage | locate in code | met or loosely based on |
---|---|---|---|---|---|---|
1 | Restorable vids in train but frames in prod |
Overesteemed results | - | ground truth gathering dataset preparation |
croping on frames | kaggle "State Farm Distracted Driver Detection" competition JACOBKIE solution |
2 | Records about same object in train and test |
Overesteemed results | Observation about same object present in different splits e.g. sample with same group-id is present present in at least two of [train,val,test] | dataset preparation modeling |
Separation on validation sets | kaggle "TalkingData Mobile User Demographics" Laurae comment |
3 | id is sorted by target or smth other unrevealed in production |
Exploits of ranking preditions using information from ids |
dataset preparation | Dataset saving | ||
4 | fit_transform on whole instead of train |
Overesteemed results | modeling | test transform | ||
5 | Time aviabilitiy of feature initialy not satisfied |
Non-adequate predictions | If the feature obviobly aviable later then the moment it refered in dataset |
dataset preparation |
Feature aggregation assigning to time axis |
|
6 | Taking information from future during the modeling |
Overesteemed results | modeling | Separation on validation sets | ||
7 | Test intersects train resolvable by search in features space | Overesteemed results | Dataset is looking like already augmented contating many versions of same e.g. pictures, audio pieces | ground truth gathering dataset preparation |
Choice of which image/audio/etc. pieces to include in train and final test | kaggle "Airbus ship detection" competition ANDRÉS MIGUEL TORRUBIA SÁEZ post |
8 | Target can be predicted by metadata | Overesteemed results | The distribution of the target varies significantly across metadata | ground truth gathering dataset preparation |
Train test split | kaggle "Deepfake Detection Challenge" competition zaharch post |
9 | Test intersects train | Overesteemed results | Identical rows between test and train | dataset preparation | Train test split and/or duplicate check | kaggle "Arxiv Title Generation" competition YURY KASHNITSKY post |
10 | Recoverable/restorable/de-anonymizable features, objects when it's not intended | Exposure of private data possible/no such data field during production | - | dataset preparation | anonimization, encoding | kaggle "Optiver Realized Volatility Prediction" competition nyanpn comment |
11 | Evaluation intersect test e.g. early stoping on test |
Overesteemed results | Test usage more than only for final estimtion of model perfomance | modeling | Fit/train code | stackoverflow "LightGBM eval question" paperskilltrees comment |
12 | OHE 1-target | No generalization | 100% on trian and error on new data | modeling | Check train/fit code | datacamp "Predicting Credit Card Approvals" project |
13 | Adversarial prompt attacks | Overestimated score | Cosine similarity is used e.g. for prompt recovery quality estimation | ML task setting: metric choice for model scoring | kaggle "LLM Prompt Recovery" competition KHOI NGUYEN solution |
This project is currently unsponsored and not affiliated with any institution. For inquiries about incorporating it into your program or publication, please contact grigoriy.tarasov.u@gmail.com.