- Regularization for Deep Learning
- Overview
- Bias Variance Trade-Off
- Strategies to make Deep Regularization Model
- Parameter Norm Penalty
- L2 Parameter Regularization
- L1 Norm Parameterization
- Comparing L1 and L2 Norm Parameterization
- Norm Regularization without Bias
- Norm Penalties as Constrained Optimization
- Explicit Constraints v/s Penalties
- Dataset Augmentation
- Semi-Supervised Learning
- Multi-task Learning
- Early Stopping
- Regularization for Deep Learning : Part - 2
- Regularization can be defined as any modification we make to learning algorithm that is intended to reduce its generalization error but not its training error.
- Best fitting model is a large model that has been regularized appropriately.
- Goal of regularization is to prevent overfitting by imposing some strategies such as -
- Put extra constraint on ML model.
- Add extra term in objective function.
- Impose ensemble method.
- Error due to bias - Difference between the expected (or average) prediction of our model and correct value which we are trying to predict.
- Error due to variance - Variance is how much the predictions for a given point vary between different realizations of the model
- Bias has a negative first-order derivative in response to model complexity while variance has a positive slope.
Consider a situation below where we started off by training data upto 10 epochs. Here we encounter case of underfitting since the model is not well defined. Going further into training epochs, we see we find the right model that fits perfectly. Upon training even more, the model starts to overfit.
- An effective regularizer is one that makes a profitable trade, reducing variance significantly while not overly increasing bias.
Consider a situation below where the goal is to split 2 points and find out which equation of line will do a better job at it?
Since the prediction of 2nd solution is more accurate, we might think that solution 2 is abetter fit than solution-1. But considering the case of overfitting, solution-1 is a better solution. Consider the activation function (sigmoid) of the 2 solutions below -
Here, in case of solution-2, the derivative of sigmoid function will be too steep and will lead to overfitting. To overcome this problem, we should penalize large weights. This can be done by taking old error term and adding a term which is big when weights are big. This can be done in 2 ways -
- Limits the model's capacity by adding norm penalty Ω(θ) parameter to objective function J.
- Does not modify the model in inference phase, but adds penalties in learning phase.
- Norm penalty penalizes only weights, leaving biases unregularized.
- Also known as Weight Decay.
- w denotes all the weights that should be affected by a norm penalty, vector θ denotes all the parameters, including both w and the unregularized parameters.
- Regularized objective function decreases both J and θ.
- Setting α ∈[0, ∞) to 0 results in no regularization and larger values of α corresponds to more regularization.
- Commonly known as Weight decay, this regularization strategy drives weights closer to origin. by adding regularization term :
- Making quadratic approximation to objective function, in the neighborhood of value of weights that obtains minimal unregularized training cost, w*
- Quadratic approximation of J gives
- Adding weight decay gradient to observe the effects of weight decay, where w~ is location of minimum -
- Since H is real and symmetric, we use Eigen decomposition to decompose H into diagonal matrix Λ and an orthonormal basis of eigenvectors,Q, such that -
- Component of w* that is aligned with i-th eigenvector of H is rescaled by a factor of (λi/λi+α.)
- When λi >> α, effect of regularization is relatively small.
- Components with λi << α, will be shrunk to have nearly zero magnitude.
- Only directions along which parameters contribute significantly to reducing objective function are preserved intact.
- L1 weight decay controls strength of regularization by scaling penalty Ω using a positive hyperparameter α. Formally, L1 regularization on the model parameter w is defined as
- Subsituting L1 norm to Ω(θ)
- Calculating gradient
- L1 regularized objective function decomposed into a sum over the parameters
- Problem of solving the above equation has a analytical solution of following form
L1 and L2 Regularization Methods The key difference between these techniques is that L1 shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.
-
L1 norm is commonly used in ML if difference between zero and non-zero elements is very important.
-
Sparsity refers to the fact that some parameters have an optimal value of zero. In this context, L1 parameterization is more sparse than L2 parameterization and can cause parameters to become 0 for large values of α.
-
Sparsity of L1 norm helps in feature-selection, e.g. LASSO, which integrates L1 penalty with linear model and a least-squares cost function. The L1 penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded.
- Usually bias of each weight is excluded in penalty terms
- The biases require less data to fit than the weights.
- Each weight specifies how two variables interact, while bias specifies how one variable interacts.
- Regularization of bias parameter can cause under-fitting.
- Sometimes, we may wish to find maximal/minimal value of f(x), for value of x in some set S. To express function with constrained condition is difficult.
- Generalized Lagrange function is given by -
- The constraint region for above lagrange can be defined as
- Solution (optimal x value) for above lagrange equation can be found by solving -
- Therefore, cost function regulaized by norm penalty is given by -
- The generalized function when we want to constrain Ω(θ) to be less than some constant k, we could construct a generalized Lagrange function
- The solution to above constraint problem is given by
- α must increase whenever Ω(θ) > k and decrease whenever Ω(θ) < k.
- Effect of constraints can be seen by fixing α* and viewing problem as function of θ
- Value of α* does not directly tell us value of k.
- We can solve for k, but the relationship between k and α* depends on form of J.
- Larger α will result in smaller constraint region.
- Smaller α will result in larger constraint region.
- For example, in stochastic gradient descent, we take a step downhill on J(θ) and then project θ back to nearest point that satisfies Ω(θ)< k, thus saving us from finding value of α coressponding to k.
- Penalties can cause nonconvex optimization procedures (unbounded objective function, or optimal solution is the "global optimum" across all feasible regions.) to get stuck in local minima corresponding to small θ.
- This manifests as training neural-net with dead-units.
- Dead-units contribute very less to learning of NN, since weights going in and out of them are very small.
- Explicit constraints with reporjection work better since they avoid weights to approach origin.
- Explict constraints come into picture when weights become larger and try to leave constraint region.
- This manifests as training neural-net with dead-units.
The best way to make a machine learning model generalize better is to train it on more data.Data augmentation is a way of creating fake data and adding it to training set.
- Consider the regression setting, where we wish to train functionˆy(x) that maps set of features x to a scalar using the least-squares cost function between the model predictions ˆy(x) and the true values y:
- Now assuming we add some random peturbation, of network weights.
- Denoting **peturbed model as **
- Below diagram shows how objective function changes before and after adding noise to weights
- Injecting noise weights makes model relatively insensitive to small variations in weights.
- Most datasets have some number of mistakes in their y labels. It can be harmful to maximize log p(y | x) when y is a mistake.
- Assuming for small constant ε, training set label y is correct with probability 1− ε.
- Label smoothing, regularizes model based on softmax with k output values by replacing the hard 0 and 1 classification targets with targets of ε/k−1and 1− ε, respectively.
- Semi-supervised learning basically represents learning of a function h=f(x) so that examples from the same class have similar representations.
- Instead of having separate unsupervised and supervised components in model, one can construct models in which a generative model of either P(x) or P(x, y) shares parameters with a discriminative model of P(y | x).
- One can then trade off the supervised criterion−log P(y | x) with the unsupervised or generative one (such as−log P(x) or−log P(x, y))
- E.g.P.C.A
- Way to improve generalization by pooling examples arising out of several tasks.
- Below diagram shows multi-tasking example where different supervised tasks share the same input x and intermediate-level representation h.
- Optimizes more than one cost functions.
- Improving generalization by leveraging domain specific information contained in training data
- Model has following 2 parts and associated parameters
- Hard-parameter sharing :
- Soft paramtere sharing :
- Motivation : When training large models with sufficient representational capacity, with time, training error decreases but validation set error begins to rise again.
- Therefore, instead of returning latest parameters, we keep a copy of model parameters every time error on validation set improves (model hits the lowest validation set error).
- Algorithm terminates when no parameters have improved over the best recorded validation error for some pre-specified number of iterations. This is called Early Stopping. (effective hyper-parameter selection)
- Controls effective capacity of model.
- Excessive training can cause over-fitting.
- Early stopping requires no change in training proceedure/objective function/set of allowable parameter values (the learning dynamics).
- Early stopping can be used alone or in conjunction with other regularization strategies.
- Early stopping requires validation data set(extra data not included with training data). Exploiting this data requires extra training after initial training. Following are 2 strategies used for 2nd training -
- Initialize model again and train all data. For 2nd training round, train data for same #steps as early-stopping predicted.
- No good way of knowing whether to train for same #paramter updates or same #passes through dataset.
- No good way of knowing whether to train for same #paramter updates or same #passes through dataset.
- Keep parameters obtained from 1st training round and then continue training using all data.
- Monitor average loss function on validation set and continue training till it falls below the value of training set objective at which early stopping procedure halted.
- Prevents high cost of re-training model from scratch.
- May not ever terminate, if objective on validation set never reaches the target value.
- Monitor average loss function on validation set and continue training till it falls below the value of training set objective at which early stopping procedure halted.
- Initialize model again and train all data. For 2nd training round, train data for same #steps as early-stopping predicted.
- Expensive cost of selecting effective hyperparameter.
- Additional cost to maintain copy of model parameters.