Overesteemed results.
"Baseline" exploiting leakerage predicts test by metadata achieving good score without modeling and generalizing anything.
The distribution of the target varies significantly across metadata
Mutually exclusive options are
- Stratify data so that metada equally distributed among the target variable.
- Cut the metadata in an irretrievable manner, providing only those features that are valid to be used during inference in production.
Anti-pattern: restrict usage of all the metadata. The modelers always highlights that height and width are considered as metadata for images and it always is used during train.
Ground truth gathering. Dataset preparation.
Saving train dataset
kaggle "Deepfake Detection Challenge" competition zaharch post