2.4 Setting up the validation framework

Notes

In general, the dataset is splitted into three parts: training, validation, and test. For each partition, we need to obtain feature matrices (X) and vectors of targets (y). First, the size of the partitions is calculated. Next, the records are shuffled to ensure that the values in the three partitions contain non-sequential records from the dataset. Finally, the partitions are created using the shuffled indices.

Pandas attributes and methods:

df.iloc[] -> return subsets of records of a dataframe, being selected by numerical indices
df.reset_index() -> restate the orginal indices
del df[col] -> eliminate a column variable

Numpy methods:

np.arange() -> return an array of numbers
np.random.shuffle() -> return a shuffled array
np.random.seed() -> set a seed for reproducibility

The entire code of this project is available in this jupyter notebook.

⚠️	The notes are written by the community. If you see an error here, please create a PR with a fix.

Notes from Peter Ernicke

Navigation

Machine Learning Zoomcamp course
Session 2: Machine Learning for Regression
Previous: Exploratory data analysis
Next: Linear regression

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04-validation-framework.md

04-validation-framework.md

2.4 Setting up the validation framework

Notes

Navigation

Files

04-validation-framework.md

Latest commit

History

04-validation-framework.md

File metadata and controls

2.4 Setting up the validation framework

Notes

Navigation