Regularization for Deep Learning
Overview
Bias Variance Trade-Off
Strategies to make Deep Regularization Model
Parameter Norm Penalty
- Modified Objective Function
L2 Parameter Regularization
L1 Norm Parameterization
- L1 Norm Calculation
Comparing L1 and L2 Norm Parameterization
Norm Regularization without Bias
Norm Penalties as Constrained Optimization
- Generalized Lagrange Function
Explicit Constraints v/s Penalties
Dataset Augmentation
- Noise Robustness
  - Injecting Noise to Weights
  - Injecting Noise at Labels
    - Label Smoothing
Semi-Supervised Learning
Multi-task Learning
- Types of Multi-task Learning
Early Stopping
- Advantages of Early Stopping
- Disadvantages of Early Stopping
Regularization for Deep Learning : Part - 2

Regularization for Deep Learning : Part- 1

Overview

Regularization can be defined as any modiﬁcation we make to learning algorithm that is intended to reduce its generalization error but not its training error.
Best ﬁtting model is a large model that has been regularized appropriately.
Goal of regularization is to prevent overfitting by imposing some strategies such as -
- Put extra constraint on ML model.
- Add extra term in objective function.
- Impose ensemble method.
Error due to bias - Difference between the expected (or average) prediction of our model and correct value which we are trying to predict.
Error due to variance - Variance is how much the predictions for a given point vary between different realizations of the model
Bias has a negative first-order derivative in response to model complexity while variance has a positive slope.

Bias Variance Trade-Off

Overview

Consider a situation below where we started off by training data upto 10 epochs. Here we encounter case of underfitting since the model is not well defined. Going further into training epochs, we see we find the right model that fits perfectly. Upon training even more, the model starts to overfit.

An eﬀective regularizer is one that makes a proﬁtable trade, reducing variance signiﬁcantly while not overly increasing bias.

Model Complexity Graph

Strategies to make Deep Regularization Model

Regularization Strategies

Consider a situation below where the goal is to split 2 points and find out which equation of line will do a better job at it?

Since the prediction of 2nd solution is more accurate, we might think that solution 2 is abetter fit than solution-1. But considering the case of overfitting, solution-1 is a better solution. Consider the activation function (sigmoid) of the 2 solutions below -

Here, in case of solution-2, the derivative of sigmoid function will be too steep and will lead to overfitting. To overcome this problem, we should penalize large weights. This can be done by taking old error term and adding a term which is big when weights are big. This can be done in 2 ways -

Parameter Norm Penalty

Limits the model's capacity by adding norm penalty Ω(θ) parameter to objective function J.
Does not modify the model in inference phase, but adds penalties in learning phase.
Norm penalty penalizes only weights, leaving biases unregularized.
Also known as Weight Decay.

Modified Objective Function

w denotes all the weights that should be aﬀected by a norm penalty, vector θ denotes all the parameters, including both w and the unregularized parameters.
Regularized objective function decreases both J and θ.
Setting α ∈[0, ∞) to 0 results in no regularization and larger values of α corresponds to more regularization.

L2 Parameter Regularization

Commonly known as Weight decay, this regularization strategy drives weights closer to origin. by adding regularization term :

L2 norm calculation

Substituting squared l2 norm as penalty -
Calculating gradient -
Applying weight update -

Effect of L2 Norm Paramterization

Making quadratic approximation to objective function, in the neighborhood of value of weights that obtains minimal unregularized training cost, w*
Quadratic approximation of J gives
- Here H refers to positive sem-definite Hessian Matrix of J w.r.t w evaluated at w*
- Minimum of J^ occurs when -

Effect of Weight Decay

Adding weight decay gradient to observe the effects of weight decay, where w~ is location of minimum -
Since H is real and symmetric, we use Eigen decomposition to decompose H into diagonal matrix Λ and an orthonormal basis of eigenvectors,Q, such that -
Component of w* that is aligned with i-th eigenvector of H is rescaled by a factor of (λi/λi+α.)
When λi >> α, eﬀect of regularization is relatively small.
Components with λi << α, will be shrunk to have nearly zero magnitude.
Only directions along which parameters contribute significantly to reducing objective function are preserved intact.

L1 Norm Parameterization

L1 weight decay controls strength of regularization by scaling penalty Ω using a positive hyperparameter α. Formally, L1 regularization on the model parameter w is deﬁned as

L1 Norm Calculation

Subsituting L1 norm to Ω(θ)
Calculating gradient
L1 regularized objective function decomposed into a sum over the parameters
Problem of solving the above equation has a analytical solution of following form
- : Here optimal value of wi under regularized objective function would be
- : Regularization shifts wi to direction by distance equal to

Comparing L1 and L2 Norm Parameterization

L1 and L2 Regularization Methods The key difference between these techniques is that L1 shrinks the less important feature’s coefficient to zero thus, removing some feature altogether. So, this works well for feature selection in case we have a huge number of features.

L1 norm is commonly used in ML if difference between zero and non-zero elements is very important.
Sparsity refers to the fact that some parameters have an optimal value of zero. In this context, L1 parameterization is more sparse than L2 parameterization and can cause parameters to become 0 for large values of α.
Sparsity of L1 norm helps in feature-selection, e.g. LASSO, which integrates L1 penalty with linear model and a least-squares cost function. The L1 penalty causes a subset of the weights to become zero, suggesting that the corresponding features may safely be discarded.

Norm Regularization without Bias

Usually bias of each weight is excluded in penalty terms
The biases require less data to fit than the weights.
Each weight specifies how two variables interact, while bias specifies how one variable interacts.
Regularization of bias parameter can cause under-fitting.

Norm Penalties as Constrained Optimization

Sometimes, we may wish to find maximal/minimal value of f(x), for value of x in some set S. To express function with constrained condition is difficult.

Generalized Lagrange Function

Generalized Lagrange function is given by -
The constraint region for above lagrange can be defined as
Solution (optimal x value) for above lagrange equation can be found by solving -
Therefore, cost function regulaized by norm penalty is given by -
The generalized function when we want to constrain Ω(θ) to be less than some constant k, we could construct a generalized Lagrange function
The solution to above constraint problem is given by
α must increase whenever Ω(θ) > k and decrease whenever Ω(θ) < k.
Effect of constraints can be seen by fixing α* and viewing problem as function of θ
- Value of α* does not directly tell us value of k.
- We can solve for k, but the relationship between k and α* depends on form of J.
- Larger α will result in smaller constraint region.
- Smaller α will result in larger constraint region.

Explicit Constraints v/s Penalties

For example, in stochastic gradient descent, we take a step downhill on J(θ) and then project θ back to nearest point that satisﬁes Ω(θ)< k, thus saving us from finding value of α coressponding to k.
Penalties can cause nonconvex optimization procedures (unbounded objective function, or optimal solution is the "global optimum" across all feasible regions.) to get stuck in local minima corresponding to small θ.
- This manifests as training neural-net with dead-units.
- Dead-units contribute very less to learning of NN, since weights going in and out of them are very small.
- Explicit constraints with reporjection work better since they avoid weights to approach origin.
- Explict constraints come into picture when weights become larger and try to leave constraint region.

Dataset Augmentation

The best way to make a machine learning model generalize better is to train it on more data.Data augmentation is a way of creating fake data and adding it to training set.

Noise Robustness

Injecting Noise to Weights

Consider the regression setting, where we wish to train functionˆy(x) that maps set of features x to a scalar using the least-squares cost function between the model predictions ˆy(x) and the true values y:
Now assuming we add some random peturbation, of network weights.
Denoting **peturbed model as **
Below diagram shows how objective function changes before and after adding noise to weights
Injecting noise weights makes model relatively insensitive to small variations in weights.

Injecting Noise at Labels

Most datasets have some number of mistakes in their y labels. It can be harmful to maximize log p(y | x) when y is a mistake.

Label Smoothing

Assuming for small constant ε, training set label y is correct with probability 1− ε.
Label smoothing, regularizes model based on softmax with k output values by replacing the hard 0 and 1 classiﬁcation targets with targets of ε/k−1and 1− ε, respectively.

Semi-Supervised Learning

Semi-supervised learning basically represents learning of a function h=f(x) so that examples from the same class have similar representations.
Instead of having separate unsupervised and supervised components in model, one can construct models in which a generative model of either P(x) or P(x, y) shares parameters with a discriminative model of P(y | x).
One can then trade oﬀ the supervised criterion−log P(y | x) with the unsupervised or generative one (such as−log P(x) or−log P(x, y))
E.g.P.C.A

Multi-task Learning

Way to improve generalization by pooling examples arising out of several tasks.
Below diagram shows multi-tasking example where diﬀerent supervised tasks share the same input x and intermediate-level representation h.
Optimizes more than one cost functions.
Improving generalization by leveraging domain specific information contained in training data
Model has following 2 parts and associated parameters
- Task-speciﬁc parameters - beneﬁt from the examples of their task to achieve good generalization.
- Generic parameters - which beneﬁt from thepooled data of all the tasks

Types of Multi-task Learning

Hard-parameter sharing :
- Greatly reduces risk of over-fitting
- Similar concept of bagging.
Soft paramtere sharing :
- Take the role of regularization.

Early Stopping

Motivation : When training large models with sufficient representational capacity, with time, training error decreases but validation set error begins to rise again.
Therefore, instead of returning latest parameters, we keep a copy of model parameters every time error on validation set improves (model hits the lowest validation set error).
Algorithm terminates when no parameters have improved over the best recorded validation error for some pre-speciﬁed number of iterations. This is called Early Stopping. (effective hyper-parameter selection)
Controls effective capacity of model.
Excessive training can cause over-fitting.

Advantages of Early Stopping

Early stopping requires no change in training proceedure/objective function/set of allowable parameter values (the learning dynamics).
Early stopping can be used alone or in conjunction with other regularization strategies.
Early stopping requires validation data set(extra data not included with training data). Exploiting this data requires extra training after initial training. Following are 2 strategies used for 2nd training -
- Initialize model again and train all data. For 2nd training round, train data for same #steps as early-stopping predicted.
  - No good way of knowing whether to train for same #paramter updates or same #passes through dataset.
- Keep parameters obtained from 1st training round and then continue training using all data.
  - Monitor average loss function on validation set and continue training till it falls below the value of training set objective at which early stopping procedure halted.
  - Prevents high cost of re-training model from scratch.
  - May not ever terminate, if objective on validation set never reaches the target value.

Disadvantages of Early Stopping

Expensive cost of selecting effective hyperparameter.
Additional cost to maintain copy of model parameters.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readme.md

Readme.md

Table of Contents

Regularization for Deep Learning : Part- 1

Overview

Bias Variance Trade-Off

Overview

Model Complexity Graph

Strategies to make Deep Regularization Model

Regularization Strategies

Parameter Norm Penalty

Modified Objective Function

L2 Parameter Regularization

L2 norm calculation

Effect of L2 Norm Paramterization

Effect of Weight Decay

L1 Norm Parameterization

L1 Norm Calculation

Comparing L1 and L2 Norm Parameterization

Norm Regularization without Bias

Norm Penalties as Constrained Optimization

Generalized Lagrange Function

Explicit Constraints v/s Penalties

Dataset Augmentation

Noise Robustness

Injecting Noise to Weights

Injecting Noise at Labels

Label Smoothing

Semi-Supervised Learning

Multi-task Learning

Types of Multi-task Learning

Early Stopping

Advantages of Early Stopping

Disadvantages of Early Stopping

Files

Readme.md

Latest commit

History

Readme.md

File metadata and controls

Table of Contents

Regularization for Deep Learning : Part- 1

Overview

Bias Variance Trade-Off

Overview

Model Complexity Graph

Strategies to make Deep Regularization Model

Regularization Strategies

Parameter Norm Penalty

Modified Objective Function

L2 Parameter Regularization

L2 norm calculation

Effect of L2 Norm Paramterization

Effect of Weight Decay

L1 Norm Parameterization

L1 Norm Calculation

Comparing L1 and L2 Norm Parameterization

Norm Regularization without Bias

Norm Penalties as Constrained Optimization

Generalized Lagrange Function

Explicit Constraints v/s Penalties

Dataset Augmentation

Noise Robustness

Injecting Noise to Weights

Injecting Noise at Labels

Label Smoothing

Semi-Supervised Learning

Multi-task Learning

Types of Multi-task Learning

Early Stopping

Advantages of Early Stopping

Disadvantages of Early Stopping