Data leaks during modeling in ML and DS system detection, mitigation, prevention ongoing Handbook

1. Scope

This project is about data leaks in Machine Learning/Data Science (ML/DS) systems that occurs due to errors in experiment design, data preparation, data modeling which can affect the final predictive system by lowering its generalization capabilities and/or skewing performance estimations. Simply put, such leaks can lead to inflated performance metrics for models that actually have lower predictive power when deployed.

Out of scope:

SecurityOps: Breaches and raw data exposure e.g. unsecured accounts
Membership inference, reference, popultion
ML competition specific metric probing/abuses and platform related abuses

Adversarial prompt attacks are in scope.

2. Aim

The goal of this project is to provide a comprehensive table for practitioners of potential data leaks in ML/DS systems, along with best practices for avoiding them, quick-fix examples, and, where possible, tests to check if data is affected by such leaks. The cases are sourced from both competitions and practical scenarios, with links to discussions or sources provided where possible. The material about errors in DS research papers is in Leakage and the reproducibility crisis in machine-learning-based science Sayash Kapoor, Arvind Narayanan it concentrates more on taxonomy without technical details like SIFTs in this repository.

3. Code

3.1 Tests

./src/leakage_tests/ contains function in python with assert statments free of binding to particular test library. Modification to your own case are meant.

3.1 Quick-fixes

Some of ./cases/*.md contains examples of particular lines replacment to exterminate the data leak.

4. Table of leaks summaries

id	name and detail link	effect	symptom	stage	locate in code	met or loosely based on
1	Restorable vids in train but frames in prod	Overesteemed results	-	ground truth gathering dataset preparation	croping on frames	kaggle "State Farm Distracted Driver Detection" competition JACOBKIE solution
2	Records about same object in train and test	Overesteemed results	Observation about same object present in different splits e.g. sample with same group-id is present present in at least two of [train,val,test]	dataset preparation modeling	Separation on validation sets	kaggle "TalkingData Mobile User Demographics" Laurae comment
3	id is sorted by target or smth other unrevealed in production	Exploits of ranking preditions using information from ids		dataset preparation	Dataset saving
4	fit_transform on whole instead of train	Overesteemed results		modeling	test transform
5	Time aviabilitiy of feature initialy not satisfied	Non-adequate predictions	If the feature obviobly aviable later then the moment it refered in dataset	dataset preparation	Feature aggregation assigning to time axis
6	Taking information from future during the modeling	Overesteemed results		modeling	Separation on validation sets
7	Test intersects train resolvable by search in features space	Overesteemed results	Dataset is looking like already augmented contating many versions of same e.g. pictures, audio pieces	ground truth gathering dataset preparation	Choice of which image/audio/etc. pieces to include in train and final test	kaggle "Airbus ship detection" competition ANDRÉS MIGUEL TORRUBIA SÁEZ post
8	Target can be predicted by metadata	Overesteemed results	The distribution of the target varies significantly across metadata	ground truth gathering dataset preparation	Train test split	kaggle "Deepfake Detection Challenge" competition zaharch post
9	Test intersects train	Overesteemed results	Identical rows between test and train	dataset preparation	Train test split and/or duplicate check	kaggle "Arxiv Title Generation" competition YURY KASHNITSKY post
10	Recoverable/restorable/de-anonymizable features, objects when it's not intended	Exposure of private data possible/no such data field during production	-	dataset preparation	anonimization, encoding	kaggle "Optiver Realized Volatility Prediction" competition nyanpn comment
11	Evaluation intersect test e.g. early stoping on test	Overesteemed results	Test usage more than only for final estimtion of model perfomance	modeling	Fit/train code	stackoverflow "LightGBM eval question" paperskilltrees comment
12	OHE 1-target	No generalization	100% on trian and error on new data	modeling	Check train/fit code	datacamp "Predicting Credit Card Approvals" project
13	Adversarial prompt attacks	Overestimated score	Cosine similarity is used e.g. for prompt recovery quality estimation	ML task setting: metric choice for model scoring		kaggle "LLM Prompt Recovery" competition KHOI NGUYEN solution

5. Rights

This project is currently unsponsored and not affiliated with any institution. For inquiries about incorporating it into your program or publication, please contact grigoriy.tarasov.u@gmail.com.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
cases		cases
src/leakage_tests		src/leakage_tests
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data leaks during modeling in ML and DS system detection, mitigation, prevention ongoing Handbook

1. Scope

2. Aim

3. Code

3.1 Tests

3.1 Quick-fixes

4. Table of leaks summaries

5. Rights

About

Releases

Packages

Languages

GrigoriiTarasov/Leaks-in-ML-DS-prevention

Folders and files

Latest commit

History

Repository files navigation

Data leaks during modeling in ML and DS system detection, mitigation, prevention ongoing Handbook

1. Scope

2. Aim

3. Code

3.1 Tests

3.1 Quick-fixes

4. Table of leaks summaries

5. Rights

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages