Get Even More Visitors To Your Blog, Upgrade To A Business Listing >>

Overfitting in Regression and Neural Nets – Visually Understanding (Overview)

For your beginning of machine learning, here I show you some primitive Overfitting example, and explain what you should care about and how to avoid. For building your intuitions, I show you several samples with many visual images.
First we see some simple overfitting examples for traditional statistical regression, and in the latter part we discuss about the case of neural network.

First look for Overfitting

Let me explain about overfitting in machine learning with a brief example of dataset as follows. (See the following plotting of sample data.)

sampledat

The following R script is my regression by linear Model for above dataset (sampledat).
To fit precisely in the given data, here we use the formula by poly() function as follows. If the result is , then  might be zero.

fit 

Here I show you the result.

As you can see, we can get the following equation (1) as fitting equation for the given data.

Now we plot this equation with the given dataset. The equation (1) is best fitting with the given data as follows.

Is that really good result ?

In fact, this given dataset (sampledat) is generated by the following R script. As you can see below, this dataset is given by with some noise data.
That is, the result (equation (1)) is overfitting !

n_sample 

Let's see the equation (1) from your bird's-eye. (See the following plotting of equation (1).)
The equation (1) is just fitting only for the given 20 data (the above "sampledat"), but not generalized one. If the new data points are generated by the previous R script, these won't fit into the equation (1).

Here we showed you a trivial overfitting example for your first understanding, but in the real practical case it's difficult to distinguish whether it's overfitting or not.
Our next interest is : How to distinguish ? How to avoid ?

Information Criterion

Now let's say, you add the extra parameter into your regression formula. But the result of likelihood has become just a slightly little improved, or almost the same. If so, you may think that this new parameter might not be needed for this regression formula.

In the statistical approach, there exists the criterion (called "Information Criterion") to judge your model fitting based on the mathematical background.
The famous one is Akaike Information Criterion (AIC) as follows. The smaller value is better fitting.

where is the number of estimated parameters and  is the maximum likelihood

Note : For both AIC and BIC (Bayesian information criterion) ideas, it's given by . In AIC,  equals 2.

The following is the plot of values for log likelihood () and AIC for the previous given dataset (sampledat). The red line is the value of likelihood and blue line is AIC. (You can easily get these values with logLik() and AIC() function in R.)

As you can see, the appropriate number of estimated parameters is 3. That is, the formula is good for fitting. (See the following note.)

Note : The hidden estimated parameters (like the variance of Gaussian, the shape of Gamma, etc) must be counted as estimated parameters for AIC. In this case, we are using Gaussian, and we must add the variance for the estimated parameters. (See my previous post "Understanding the basis of GLM Regression" for details.)
For instance, if the formula (equation) is , then the estimated parameters are , and the variance (3 parameters).

For instance, if we use the following dataset, the number of parameters must be 4. Then the equation (formula) must be .

Here we use only single input (), but if you have several input parameters, you must also consider the interactions each other.

Overfitting in neural networks

Let's proceed to the neural nets for discussion.

First you must remember that a large number of layers and neurons often causes the overfitting. Especially the layer will affect the complexity so much.

To simplify our example, let's say here is brief feed-forward neural nets by sigmoid with two input variables () and one binary output (the output between 0 and 1).
If we have 1 hidden layer, it can represent the model as following illustrated. (The model can have several linear boundaries and these combination.)

If we have 2 hidden layers, it can represent more complex models as following illustrated. (These are the combination of 1 layer's models.)

Granting that we have some noise data, 2 hidden layers' network might cause the overfitting as follows.
As you can see here, the large layers will cause the overfitting.

Note : Unlike the statistical approach there's no concrete criterion to decide how much is the best for layers or neurons, because no common evaluation property based on the mathematical model is there.
You must examine and evaluate the generated model with test data or validation data.

The model complexity is also caused by the large coefficients. Let's see the next example.

As you know, the sigmoid has the following linear part and binary part. The linear part can smoothly fit, but the binary one doesn't (binary fit).
As weights are increased, the binary part becomes more stronger than the linear part.

For example, let's see the following illustrated network.

This network results into the following plotting (wire frame). ( is inputs, and z is output.) As you can see, it's smoothly transitioning.

Let's see the following next example.
This network is having exactly same boundary as previous one, but the coefficients (weights and bias) are so large.

When we plot the inputs () and outputs (), it becomes more sharp than before.

As weights are increased and it has enough layers and neurons, the model can easily produce more complex models. As a result it causes overfitting and the lack of generalization.

Large coefficients are easily be generated.
You just learn with too many training iterations (inputs, epoch, etc). Train ! Train ! Train ! The coefficient's growth is caused by gradient descent.
For instance, the following is the simple feed-forward nets for recognizing hand-writing digit by mxnetR. This script outputs the variance of each layer's weights.

require(mxnet)
...

# configure network
data num.round = 10,
  learning.rate=0.07)

# dump weights and biases
params 

When we set num.round = 100 (see the above bold font) in this script, we can get more distributed large coefficients as follows.

epoch = 10

epoch = 100

There exist several regularization techniques to mitigate these overfittings as follows.

  • Early Stopping - A method to stop learning when some condition occurs (ex: the condition when the error is higher than the last check, etc)
  • Penalty - A method to set the penalty term for avoiding weight's increase (weight decay penalty) in gradient descent evaluation
  • Dropout - A method to randomly drop the neurons in each training phase. By doing this, it avoids the overfitting of co-adaptation when it has so complex structure with many layers and neurons. As a result, it accomplishes the model combination (same like ensemble learning) by the non-expensive way.

The supported regularization method will differ from each framework.
For the libraries by Microsoft, you can implement early stopping and dropout with CNTK (see below), but rxNeuralNet (in MicrosoftML) and NET# doesn't support.

# Dropout with CNTK (Python)
...

with default_options(activation=relu, pad=True):
  model = Sequential([
    LayerStack(2, lambda : [
      Convolution((3,3), 64),
      Convolution((3,3), 64),
      MaxPooling((3,3), strides=2)
    ]),
    LayerStack(2, lambda i: [
      Dense([256,128][i]), 
      Dropout(0.5)
    ]),
    Dense(4, activation=None)
  ])
...

Share the post

Overfitting in Regression and Neural Nets – Visually Understanding (Overview)

×

Subscribe to Msdn Blogs | Get The Latest Information, Insights, Announcements, And News From Microsoft Experts And Developers In The Msdn Blogs.

Get updates delivered right to your inbox!

Thank you for your subscription

×