marp | math |
---|---|
true |
true |
- Get familiar with the general workflow of a (supervised) Machine Learning project
- Understand each step of this process, from problem definition to model deployment.
- Discover how to train a Machine Learning model on tabular data.
You may test the trained model here.
- Frame the problem.
- Collect, analyze and prepare data.
- Select and train several models on data.
- Tune the most promising model.
- Deploy the model to production and monitor it.
- What is the business objective?
- How good are the current solutions?
- What data is available?
- Is the problem a good fit for ML?
- What is the expected learning type (supervised or not, batch/online...)?
- How will the model's performance be evaluated?
- Difficulty to express the actions as rules.
- Data too complex for traditional analytical methods.
- High number of features.
- Highly correlated data (data with similar or closely related values).
- Performance > interpretability.
- Data quality is paramount.
- Regression task.
- Inputs: housing properties (number of rooms, median income, etc).
- Output: housing prices.
-
Real data is messy, incomplete and often scattered across many sources.
-
Data labeling is a manual and tedious process.
-
Predefined datasets offer a convenient way to bypass the data wrangling step. Alas, using one is not always an option.
Many datasets containing tabular information are stored as a CSV (Comma-Separated Values) text file.
An example dataset containing housing data and prices is available here.
The objective here is to gain insights about the data, in order to prepare it optimally for training. This might involve:
- plotting histograms of values.
- computing statistical metrics like values repartition or correlation between features.
- ...
Once trained, a ML model must be able to generalize (perform well with new data). In order to assert this ability, data is always split into two or three sets before training:
- Training set (typically 80% or more): fed to the model during training.
- Test set: used to check the final model's performance on unseen data.
- Validation set: used to tune the model without biasing it in favor of the test set.
During dataset splitting, inputs (features given to the model) have to be separated from targets (values it must predict).
In Machine Learning, the chosen dataset has to be carefully prepared before using it to train a model. This can have a major impact on the outcome of the training process.
This important task, sometimes called data preprocessing, might involve:
- Removing superflous features (if any).
- Adding missing values.
- Scaling data.
- Transforming values into numeric form.
- Augmenting data with artificially generated samples.
- Engineering new features.
Most ML algorithms cannot work with missing values in features.
Depending on the percentage of missing data, three options exist:
- remove the corresponding data samples;
- remove the whole feature(s);
- replace the missing values (using 0, the mean, the median or something more meaningful in the context).
Most ML algorithms work best when all features have a similar scale. Several solutions exist:
-
Min-Max scaling: features are shifted and rescaled to the
$[0,1]$ range by substracting themin
value and dividing by(max-min)
on the first axis. - Standardization: features are centered (substracted by their mean) then reduced (divided by their standard deviation) on the first axis. All resulting features have a mean of 0 and a standard deviation of 1.
In order to to avoid information leakage, the test set must be scaled with metrics (means, categories, etc) computed on the training set (explanation 1, explanation 2, explanation 3).
Some features or targets may come as discrete rather than continuous values. Moreover, these discrete values might be strings. ML models are only able to manage numerical-only data.
A solution is to apply one-of-K encoding, also named dummy encoding or one-hot encoding. Each categorical feature with K
possible values is transformed into a vector of K
binary features, with one of them 1 and all others 0.
Using arbitrary integer values rather than binary vectors would create a proximity relationship between the new features, which could confuse the model during training.
Depending on value distribution between training and test sets, some categories might appear only in one set.
The best solution is to one-hot encode based on the training set categories, ignoring test-only categories.
Data augmentation is the process of enriching a dataset by adding new samples, slightly modified copies of existing data or newly created synthetic data.
Feature engineering is the process of preparing the proper input features, in order to facilitate the learning task. The problem is made easier by expressing it in a simpler way. This usually requires a good domain knowledge.
The ability of deep neural networks to discover useful features by themselves has somewhat reduced the criticality of feature engineering. Nevertheless, it remains important in order to solve problems more elegantly and with fewer data.
Example (taken from the book Deep Learning with Python): the task of learning the time of day from a clock is far easier with engineered features rather than raw clock images.
At long last, our data is ready and we can start training models.
This step is often iterative and can be quite empirical. Depending on data and model complexity, it can also be resource-intensive.
- Underfitting (sometimes called bias): insufficient performance on training set.
- Overfitting (sometimes called variance): performance gap between training and validation sets.
Ultimately, we look for a tradeoff between underfitting and overfitting.
The goal of the training step is to find a model powerful enough to overfit the training set.
-
Tackle underfitting:
- Use a more complex model
- Train the model longer
-
Tackle overfitting:
- Use more training data
- Limit the model complexity
- Introduce model-specific solutions
Model performance is assessed through an evaluation metric. Like the loss function, it depends on the problem type.
A classic choice for regression tasks is the Root Mean Square Error (RMSE). It gives an idea of how much error the trained model typically makes in its predictions. Of course, the smaller the better in that case.
Mean Absolute Error (less sensitive to outliers) and MSE may also be used.
For each learning type (supervised, unsupervised...), several models of various complexity exist.
It is often useful to begin the training step by using a basic model. Its results will serve as a baseline when training more complicated models. In some cases, its performance might even be surprisingly good.
In this example, a Linear Regression model would be a good first choice.
After obtaining baseline results, other more sophisticated models may be tried, for example a Decision Tree in our case.
Some results looks too good to be true. They are often cases of severe overfitting to the training set, which means the model won't perform well with unseen data.
As seen earlier, one way to assert overfitting is to split training data between a smaller training set and a validation set, used only to evaluate model performance after each training iteration.
A more sophisticated strategy is to apply K-fold cross validation. Training data is randomly split into
Once a model looks promising, it must be tuned in order to offer the best compromise between optimization and generalization.
The goal is to find the set of model properties that gives the best performance. Model properties are often called hyperparameters (example: maximum depth for a decision tree).
This step can be either:
- manual, tweaking model hyperparameters by hand.
- automated, using a tool to explore the model hyperparameter spaces.
Now is the time to evaluate the final model on the test set that we put apart before.
Ass seen before, preprocessing operations should be applied to test data using preprocessing metrics computed on training data.
This step depends on the technology stack.
It's often useful to save the pipeline of preprocessing operations (if any) alongside the trained model, since these operations must be applied to production data before using the model in inference mode.
This step is highly context-dependent. A deployed model is often a part of a more important system. Some common solutions:
- deploying the model as a web service accessible through an API.
- embedding the model into the user device.
The Flask web framework is often used to create a web API from a trained Python model.
You may test the trained model here.
In order to guarantee an optimal quality of service, the deployed system must be carefully monitored. This may involve:
- Checking the system’s live availability and performance at regular intervals.
- Sampling the system’s predictions and evaluating them.
- Checking input data quality.
- Retraining the model on fresh data.