In general, the dataset is splitted into three parts: training, validation, and test. For each partition, we need to obtain feature matrices (X) and vectors of targets (y). First, the size of the partitions is calculated. Next, the records are shuffled to ensure that the values in the three partitions contain non-sequential records from the dataset. Finally, the partitions are created using the shuffled indices.
Pandas attributes and methods:
df.iloc[]
-> return subsets of records of a dataframe, being selected by numerical indicesdf.reset_index()
-> restate the orginal indicesdel df[col]
-> eliminate a column variable
Numpy methods:
np.arange()
-> return an array of numbersnp.random.shuffle()
-> return a shuffled arraynp.random.seed()
-> set a seed for reproducibility
The entire code of this project is available in this jupyter notebook.
The notes are written by the community. If you see an error here, please create a PR with a fix. |