Patrick McGuire: Understanding regularization for image classification and machine learning

Monday, September 19, 2016

Understanding regularization for image classification and machine learning

regularization_header

In previous tutorials, I’ve discussed two important loss functions: Multi-class SVM loss and cross-entropy loss (which we usually refer to in conjunction with Softmax classifiers).

In order to to keep our discussions of these loss functions straightforward, I purposely left out an important component: regularization.

While our loss function allows us to determine how well (or poorly) our set of parameters (i.e., weight matrix, and bias vector) are performing on a given classification task, the loss function itself does not take into account how the weight matrix “looks”.

What do I mean by “looks”?

Well, keep in mind that there may be an infinite set of parameters that obtain reasonable classification accuracy on our dataset — how do we go about choosing a set of parameters that will help ensure our model generalizes well? Or at the very least, lessen the affects of overfitting?

The answer is regularization.

There are various types of regularization techniques, such as L1 regularization, L2 regularization, and Elastic Net — and in the context of Deep Learning, we also have dropout (although dropout is more-so a technique rather than an actual function).

Inside today’s tutorial, we’ll mainly be focusing on the former rather than the later. Once we get to more advanced deep learning tutorials, I’ll dedicate time to discussing dropout as well.

In the remainder of this blog post, I’ll be discussing regularization further. I’ll also demonstrate how to update our Multi-class SVM loss and cross-entropy loss functions to include regularization. Finally, we’ll write some Python code to construct a classifier that applies regularization to an image classification problem.

Looking for the source code to this post?
Jump right to the downloads section.

Understanding regularization for image classification and machine learning

The remainder of this blog post is broken into four parts. First, we discuss what regularization is. I then detail how to update our loss function to include the regularization term.

From there, I list out three common types of regularization you’ll likely see when performing image classification and machine learning, especially in the context of neural networks and deep learning.

Finally, I’ll provide a Python + scikit-learn example that demonstrates how to apply regularization to an image classification dataset.

What is regularization and why do we need it?

Regularization helps us tune and control our model complexity, ensuring that our models are better at making (correct) classifications — or more simply, the ability to generalize.

If we don’t apply regularization, our classifiers can easily become too complex and overfit to our training data, in which case we lose the ability to generalize to our testing data (and data points outside the testing set as well).

Similarly, without applying regularization we also run the risk of underfitting. In this case, our model performs poorly on the training our — our classifier is not able to model the relationship between the input data and the output class labels.

Underfitting is relatively easy to catch — you examine the classification accuracy on your training data and take a look at your model.

If your training accuracy is very low and your model is excessively simple, then you are likely a victim of underfitting. The normal remedy to underfitting is to essentially increase the number of parameters in your model, thereby increasing complexity.

Overfitting is a different beast entirely though.

While you can certainly monitor your training accuracy and recognizing when your classifier is performing too well on the training data and not good enough on the testing data, it becomes harder to correct.

There is also the problem that you can walk a very fine line between model complexity — if you simplify your model too much, then you’ll be back to underfitting.

A better approach is to apply regularization which will help our model generalize and lead to less overfitting.

The best way to understand regularization is to see the implications it has on our loss function, which I discuss in the next section.

Updating our loss function to include regularization

Let’s start with our Multi-class SVM loss function:

$L_{i} = \sum_{j \neq y_{i}} max(0, s_{j} - s_{y_{i}} + 1)$

The loss for the entire training set can be written as:

$L = \frac{1}{N} \sum^{N}_{i=1} L_{i}$

Now, let’s say that we have obtained a weight matrix W such that every data point in our training set is classified 100% correctly — this implies that our loss $L = 0$ for all $L_{i}$ .

Awesome, we’re getting 100% accuracy — but let me ask you a question about this weight matrix W — is this matrix unique?

Or in other words, are there BETTER choices of W that will improve our model’s ability to generalize and reduce overfitting?

If there is such a W, how do we know? And how can we incorporate this type of penalty into our loss function?

The answer is to define a regularization penalty, a function that operates on our weight matrix W.

The regularization penalty is commonly written as the function R(W).

Below is the most common regularization penalty, L2 regularization:

$R(W) = \sum_{i}\sum_{j} W_{i,j}^{2}$

What is this function doing exactly?

To answer this question, if I were to write this function in Python code, it would look something like this using two

for

loops:

penalty = 0

for i in np.arange(0, W.shape[0]):
        for j in np.arange(0, W.shape[1]):
                penalty += (W[i][j] ** 2)

What we are doing here is looping over all entries in the matrix and taking the sum of squares. There are more efficient ways to compute this of course, I’m just simplifying the code as a matter of explanation.

The sum of squares in the L2 regularization penalty discourages large weights in our weight matrix W, preferring smaller ones.

Why might we want to discourage large weight values?

In short, by penalizing large weights we can improve our ability to generalize, and thereby reduce overfitting.

Think of it this way — the larger a weight value is, the more influence it has on the output prediction. This implies that dimensions with larger weight values can almost singlehandedly control the output prediction of the classifier (provided the weight value is large enough, of course), which will almost certainly lead to overfitting.

To mitigate the affect various dimensions have on our output classifications, we apply regularization, thereby seeking W values that take into account all of the dimensions rather than the few with large values.

In practice, you may find that regularization hurts your training accuracy slightly, but actually increases your testing accuracy (your ability to generalize).

Again, our loss function has the following basic form, but now we just add in regularization:

$L = \frac{1}{N} \sum^{N}_{i=1} L_{i} + \lambda R(W)$

The first term we have already seen before — this is the average loss over all samples in our training set.

The second term is new — this is our regularization term.

The $\lambda$ variable is a hyperparameter that controls the amount or strength of the regularization we are applying. In practice, both the learning rate $\alpha$ and regularization term $\lambda$ are hyperparameters that you’ll spend most of your time tuning.

Expanding the Multi-class SVM loss to include regularization yields the final equation:

$L =\frac{1}{N} \sum^{N}_{i=1} \sum_{j \neq y_{i}} [max(0, s_{j} - s_{y_{i}} + 1)] + \lambda \sum_{i} \sum_{j} W_{i, j}^{2}$

We can also expand cross-entropy loss in a similar fashion:

$L =\frac{1}{N} \sum^{N}_{i=1} [-log(e^{s_{y_{i}}} / \sum_{j} e^{s_{j}})] +\lambda \sum_{i} \sum_{j} W_{i, j}^{2}$

For a more mathematically motivated discussion of regularization, take a look at Karpathy’s excellent slides from the CS231n course.

Types of regularization techniques

In general, you’ll see three common types of regularization.

The first, we reviewed earlier in this blog post, L2 regularization;

$R(W) = \sum_{i}\sum_{j} W_{i,j}^{2}$

We also have L1 regularization which takes the absolute value rather than the square:

$R(W) = \sum_{i}\sum_{j} |W_{i,j}|$

Elastic Net regularization seeks to combine both L1 and L2 regularization:

$R(W) = \sum_{i}\sum_{j} \beta W_{i,j}^{2} + |W_{i,j}|$

In terms of which regularization method you should be using (including none at all), you should treat this choice as a hyperparameter you need to optimize over and perform experiments to determine if regularization should be applied, and if so, which method of regularization.

Finally, I’ll note that there is another very common type of regularization that we’ll see in a future tutorial — dropout.

Dropout is frequently used in Deep Learning, especially with Convolutional Neural Networks.

Unlike L1, L2, and Elastic Net regularization, which boil down to functions defined in the form R(W), dropout is an actual technique we apply to the connections between nodes in a Neural Network.

As the name implies, connections “dropout” and randomly disconnect during training time, ensuring that no one node in the network becomes fully responsible for “learning” to classify a particular label. I’ll save a more thorough discussion of dropout for a future blog post.

Image classification using regularization with Python and scikit-learn

Now that we’ve discussed regularization in the context of machine learning, let’s look at some code that actually performs various types of regularization.

All of the code associated with this blog post, expect for the final code block, has already been reviewed extensively in previous blog posts in this series.

Therefore, for a thorough review of the actual process used to extract features and construct the training and testing split for the Kaggle Dogs vs. Cats dataset, I’ll refer you to the introduction to linear classification tutorial.

You can download the full code to this blog post by using the “Downloads” section at the bottom of this tutorial.

The code block below demonstrates how to apply the Stochastic Gradient Descent (SGD) classifier with log-loss (i.e., Softmax) and various types of regularization methods to our dataset:

# import the necessary packages
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split
from imutils import paths
import numpy as np
import argparse
import imutils
import cv2
import os

def extract_color_histogram(image, bins=(8, 8, 8)):
        # extract a 3D color histogram from the HSV color space using
        # the supplied number of `bins` per channel
        hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV)
        hist = cv2.calcHist([hsv], [0, 1, 2], None, bins,
                [0, 180, 0, 256, 0, 256])

        # handle normalizing the histogram if we are using OpenCV 2.4.X
        if imutils.is_cv2():
                hist = cv2.normalize(hist)

        # otherwise, perform "in place" normalization in OpenCV 3 (I
        # personally hate the way this is done
        else:
                cv2.normalize(hist, hist)

        # return the flattened histogram as the feature vector
        return hist.flatten()

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-d", "--dataset", required=True,
        help="path to input dataset")
args = vars(ap.parse_args())

# grab the list of images that we'll be describing
print("[INFO] describing images...")
imagePaths = list(paths.list_images(args["dataset"]))

# initialize the data matrix and labels list
data = []
labels = []

# loop over the input images
for (i, imagePath) in enumerate(imagePaths):
        # load the image and extract the class label (assuming that our
        # path as the format: /path/to/dataset/{class}.{image_num}.jpg
        image = cv2.imread(imagePath)
        label = imagePath.split(os.path.sep)[-1].split(".")[0]

        # extract a color histogram from the image, then update the
        # data matrix and labels list
        hist = extract_color_histogram(image)
        data.append(hist)
        labels.append(label)

        # show an update every 1,000 images
        if i > 0 and i % 1000 == 0:
                print("[INFO] processed {}/{}".format(i, len(imagePaths)))

# encode the labels, converting them from strings to integers
le = LabelEncoder()
labels = le.fit_transform(labels)

# partition the data into training and testing splits, using 75%
# of the data for training and the remaining 25% for testing
print("[INFO] constructing training/testing split...")
(trainData, testData, trainLabels, testLabels) = train_test_split(
        np.array(data), labels, test_size=0.25, random_state=42)

# loop over our set of regularizers
for r in (None, "l1", "l2", "elasticnet"):
        # train a Stochastic Gradient Descent classifier using a softmax
        # loss function, the specified regularizer, and 10 epochs
        print("[INFO] training model with `{}` penalty".format(r))
        model = SGDClassifier(loss="log", penalty=r, random_state=967,
                n_iter=10)
        model.fit(trainData, trainLabels)

        # evaluate the classifier
        acc = model.score(testData, testLabels)
        print("[INFO] `{}` penalty accuracy: {:.2f}%".format(r, acc * 100))

On Line 74 we start looping over our regularization methods, including

None

for no regularization.

We then train our

SGDClassifier

on Lines 78-80 using the specified regularization method.

Lines 83 and 84 evaluate our trained classifier on the testing data and display the accuracy.

Below I have included a screenshot from executing the script on my machine:

Figure 1: Applying no regularization, L1 regularization, L2 regularization, and Elastic Net regularization to our classification project.

As we can see, classification accuracy on the testing set improves as regularization is introduced.

We obtain 63.58% accuracy with no regularization. Applying L1 regularization increases our accuracy to 64.02%. L2 regularization improves again to 64.38%. Finally, Elastic Net, which combines both L1 and L2 regularization obtains the highest accuracy of 64.40%.

Does this mean that we should always apply Elastic Net regularization?

Of course not — this is entirely dependent on your dataset and features. You should treat regularization, and any parameters associated with your regularization method, as hyperparameters that need to be searched over.

Summary

In today’s blog post, I discussed the concept of regularization and the impact it has on machine learning classifiers. Specifically, we use regularization to control overfitting and underfitting.

Regularization works by examining our weight matrix W and penalizing it if it does not confirm to the specified penalty function.

Applying this penalty helps ensure we learn a weight matrix W that generalizes better and thereby helps lesson the negative affects of overfitting.

In practice, you should apply hyperparameter tuning to determine:

If regularization should be applied, and if so, which regularization method should be used.
The strength of the regularization (i.e., the $\lambda$ variable).

You may notice that applying regularization may actually decrease your training set classification accuracy — this is acceptable provided that your testing set accuracy increases, which would be a demonstration of regularization in action (i.e., avoiding/lessening the impact of overfitting).

In next week’s blog post, I’ll be discussing how to build a simple feedforward neural network using Python and Keras. Be sure to enter your email address in the form below to be notified when this blog post goes live!

Downloads:

The post Understanding regularization for image classification and machine learning appeared first on PyImageSearch.

from PyImageSearch http://ift.tt/2cjYWWT
via IFTTT

Patrick McGuire

Latest YouTube Video

Monday, September 19, 2016

Understanding regularization for image classification and machine learning

Understanding regularization for image classification and machine learning

What is regularization and why do we need it?

Updating our loss function to include regularization

Types of regularization techniques

Image classification using regularization with Python and scikit-learn

Summary

Downloads:

No comments:

Click to Show Support

Click to Show Support