Effect

Overesteemed results.

"Baseline" exploiting leakerage predicts test by metadata achieving good score without modeling and generalizing anything.

Symptom

The distribution of the target varies significantly across metadata

Rules to avoid leakage

Mutually exclusive options are

Stratify data so that metada equally distributed among the target variable.
Cut the metadata in an irretrievable manner, providing only those features that are valid to be used during inference in production.

Anti-pattern: restrict usage of all the metadata. The modelers always highlights that height and width are considered as metadata for images and it always is used during train.

Incorporation stage

Ground truth gathering. Dataset preparation.

Locate in code

Saving train dataset

Was met or loosely based on

kaggle "Deepfake Detection Challenge" competition zaharch post

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metadata_predicts_target.md

metadata_predicts_target.md

Effect

Symptom

Rules to avoid leakage

Incorporation stage

Locate in code

Was met or loosely based on

Files

metadata_predicts_target.md

Latest commit

History

metadata_predicts_target.md

File metadata and controls

Effect

Symptom

Rules to avoid leakage

Incorporation stage

Locate in code

Was met or loosely based on