In previous tutorials, I’ve discussed two important loss functions: Multi-class SVM loss and cross-entropy loss (which we usually refer to in conjunction with Softmax classifiers).
In order to to keep our discussions of these loss functions straightforward, I purposely left out an important component: regularization.
While our loss function allows us to determine how well (or poorly) our set of parameters (i.e., weight matrix, and bias vector) are performing on a given classification task, the loss function itself does not take into account how the weight matrix “looks”.
What do I mean by “looks”?
Well, keep in mind that there may be an infinite set of parameters that obtain reasonable classification accuracy on our dataset — how do we go about choosing a set of parameters that will help ensure our model generalizes well? Or at the very least, lessen the affects of overfitting?
The answer is regularization.
There are various types of regularization techniques, such as L1 regularization, L2 regularization, and Elastic Net — and in the context of Deep Learning, we also have dropout (although dropout is more-so a technique rather than an actual function).
Inside today’s tutorial, we’ll mainly be focusing on the former rather than the later. Once we get to more advanced deep learning tutorials, I’ll dedicate time to discussing dropout as well.
In the remainder of this blog post, I’ll be discussing regularization further. I’ll also demonstrate how to update our Multi-class SVM loss and cross-entropy loss functions to include regularization. Finally, we’ll write some Python code to construct a classifier that applies regularization to an image classification problem.
Looking for the source code to this post?
Jump right to the downloads section.
Understanding regularization for image classification and machine learning
The remainder of this blog post is broken into four parts. First, we discuss what regularization is. I then detail how to update our loss function to include the regularization term.
From there, I list out three common types of regularization you’ll likely see when performing image classification and machine learning, especially in the context of neural networks and deep learning.
Finally, I’ll provide a Python + scikit-learn example that demonstrates how to apply regularization to an image classification dataset.
What is regularization and why do we need it?
Regularization helps us tune and control our model complexity, ensuring that our models are better at making (correct) classifications — or more simply, the ability to generalize.
If we don’t apply regularization, our classifiers can easily become too complex and overfit to our training data, in which case we lose the ability to generalize to our testing data (and data points outside the testing set as well).
Similarly, without applying regularization we also run the risk of underfitting. In this case, our model performs poorly on the training our — our classifier is not able to model the relationship between the input data and the output class labels.
Underfitting is relatively easy to catch — you examine the classification accuracy on your training data and take a look at your model.
If your training accuracy is very low and your model is excessively simple, then you are likely a victim of underfitting. The normal remedy to underfitting is to essentially increase the number of parameters in your model, thereby increasing complexity.
Overfitting is a different beast entirely though.
While you can certainly monitor your training accuracy and recognizing when your classifier is performing too well on the training data and not good enough on the testing data, it becomes harder to correct.
There is also the problem that you can walk a very fine line between model complexity — if you simplify your model too much, then you’ll be back to underfitting.
A better approach is to apply regularization which will help our model generalize and lead to less overfitting.
The best way to understand regularization is to see the implications it has on our loss function, which I discuss in the next section.
Updating our loss function to include regularization
Let’s start with our Multi-class SVM loss function:
The loss for the entire training set can be written as:
Now, let’s say that we have obtained a weight matrix W such that every data point in our training set is classified 100% correctly — this implies that our loss for all .
Awesome, we’re getting 100% accuracy — but let me ask you a question about this weight matrix W — is this matrix unique?
Or in other words, are there BETTER choices of W that will improve our model’s ability to generalize and reduce overfitting?
If there is such a W, how do we know? And how can we incorporate this type of penalty into our loss function?
The answer is to define a regularization penalty, a function that operates on our weight matrix W.
The regularization penalty is commonly written as the function R(W).
Below is the most common regularization penalty, L2 regularization:
What is this function doing exactly?
To answer this question, if I were to write this function in Python code, it would look something like this using two
forloops:
penalty = 0 for i in np.arange(0, W.shape[0]): for j in np.arange(0, W.shape[1]): penalty += (W[i][j] ** 2)
What we are doing here is looping over all entries in the matrix and taking the sum of squares. There are more efficient ways to compute this of course, I’m just simplifying the code as a matter of explanation.
The sum of squares in the L2 regularization penalty discourages large weights in our weight matrix W, preferring smaller ones.
Why might we want to discourage large weight values?
In short, by penalizing large weights we can improve our ability to generalize, and thereby reduce overfitting.
Think of it this way — the larger a weight value is, the more influence it has on the output prediction. This implies that dimensions with larger weight values can almost singlehandedly control the output prediction of the classifier (provided the weight value is large enough, of course), which will almost certainly lead to overfitting.
To mitigate the affect various dimensions have on our output classifications, we apply regularization, thereby seeking W values that take into account all of the dimensions rather than the few with large values.
In practice, you may find that regularization hurts your training accuracy slightly, but actually increases your testing accuracy (your ability to generalize).
Again, our loss function has the following basic form, but now we just add in regularization:
The first term we have already seen before — this is the average loss over all samples in our training set.
The second term is new — this is our regularization term.
The variable is a hyperparameter that controls the amount or strength of the regularization we are applying. In practice, both the learning rate and regularization term are hyperparameters that you’ll spend most of your time tuning.
Expanding the Multi-class SVM loss to include regularization yields the final equation:
We can also expand cross-entropy loss in a similar fashion:
For a more mathematically motivated discussion of regularization, take a look at Karpathy’s excellent slides from the CS231n course.
Types of regularization techniques
In general, you’ll see three common types of regularization.
The first, we reviewed earlier in this blog post, L2 regularization;
We also have L1 regularization which takes the absolute value rather than the square:
Elastic Net regularization seeks to combine both L1 and L2 regularization:
In terms of which regularization method you should be using (including none at all), you should treat this choice as a hyperparameter you need to optimize over and perform experiments to determine if regularization should be applied, and if so, which method of regularization.
Finally, I’ll note that there is another very common type of regularization that we’ll see in a future tutorial — dropout.
Dropout is frequently used in Deep Learning, especially with Convolutional Neural Networks.
Unlike L1, L2, and Elastic Net regularization, which boil down to functions defined in the form R(W), dropout is an actual technique we apply to the connections between nodes in a Neural Network.
As the name implies, connections “dropout” and randomly disconnect during training time, ensuring that no one node in the network becomes fully responsible for “learning” to classify a particular label. I’ll save a more thorough discussion of dropout for a future blog post.
Image classification using regularization with Python and scikit-learn
Now that we’ve discussed regularization in the context of machine learning, let’s look at some code that actually performs various types of regularization.
All of the code associated with this blog post, expect for the final code block, has already been reviewed extensively in previous blog posts in this series.
Therefore, for a thorough review of the actual process used to extract features and construct the training and testing split for the Kaggle Dogs vs. Cats dataset, I’ll refer you to the introduction to linear classification tutorial.
You can download the full code to this blog post by using the “Downloads” section at the bottom of this tutorial.
The code block below demonstrates how to apply the Stochastic Gradient Descent (SGD) classifier with log-loss (i.e., Softmax) and various types of regularization methods to our dataset:
# import the necessary packages from sklearn.preprocessing import LabelEncoder from sklearn.linear_model import SGDClassifier from sklearn.metrics import classification_report from sklearn.cross_validation import train_test_split from imutils import paths import numpy as np import argparse import imutils import cv2 import os def extract_color_histogram(image, bins=(8, 8, 8)): # extract a 3D color histogram from the HSV color space using # the supplied number of `bins` per channel hsv = cv2.cvtColor(image, cv2.COLOR_BGR2HSV) hist = cv2.calcHist([hsv], [0, 1, 2], None, bins, [0, 180, 0, 256, 0, 256]) # handle normalizing the histogram if we are using OpenCV 2.4.X if imutils.is_cv2(): hist = cv2.normalize(hist) # otherwise, perform "in place" normalization in OpenCV 3 (I # personally hate the way this is done else: cv2.normalize(hist, hist) # return the flattened histogram as the feature vector return hist.flatten() # construct the argument parse and parse the arguments ap = argparse.ArgumentParser() ap.add_argument("-d", "--dataset", required=True, help="path to input dataset") args = vars(ap.parse_args()) # grab the list of images that we'll be describing print("[INFO] describing images...") imagePaths = list(paths.list_images(args["dataset"])) # initialize the data matrix and labels list data = [] labels = [] # loop over the input images for (i, imagePath) in enumerate(imagePaths): # load the image and extract the class label (assuming that our # path as the format: /path/to/dataset/{class}.{image_num}.jpg image = cv2.imread(imagePath) label = imagePath.split(os.path.sep)[-1].split(".")[0] # extract a color histogram from the image, then update the # data matrix and labels list hist = extract_color_histogram(image) data.append(hist) labels.append(label) # show an update every 1,000 images if i > 0 and i % 1000 == 0: print("[INFO] processed {}/{}".format(i, len(imagePaths))) # encode the labels, converting them from strings to integers le = LabelEncoder() labels = le.fit_transform(labels) # partition the data into training and testing splits, using 75% # of the data for training and the remaining 25% for testing print("[INFO] constructing training/testing split...") (trainData, testData, trainLabels, testLabels) = train_test_split( np.array(data), labels, test_size=0.25, random_state=42) # loop over our set of regularizers for r in (None, "l1", "l2", "elasticnet"): # train a Stochastic Gradient Descent classifier using a softmax # loss function, the specified regularizer, and 10 epochs print("[INFO] training model with `{}` penalty".format(r)) model = SGDClassifier(loss="log", penalty=r, random_state=967, n_iter=10) model.fit(trainData, trainLabels) # evaluate the classifier acc = model.score(testData, testLabels) print("[INFO] `{}` penalty accuracy: {:.2f}%".format(r, acc * 100))
On Line 74 we start looping over our regularization methods, including
Nonefor no regularization.
We then train our
SGDClassifieron Lines 78-80 using the specified regularization method.
Lines 83 and 84 evaluate our trained classifier on the testing data and display the accuracy.
Below I have included a screenshot from executing the script on my machine:
As we can see, classification accuracy on the testing set improves as regularization is introduced.
We obtain 63.58% accuracy with no regularization. Applying L1 regularization increases our accuracy to 64.02%. L2 regularization improves again to 64.38%. Finally, Elastic Net, which combines both L1 and L2 regularization obtains the highest accuracy of 64.40%.
Does this mean that we should always apply Elastic Net regularization?
Of course not — this is entirely dependent on your dataset and features. You should treat regularization, and any parameters associated with your regularization method, as hyperparameters that need to be searched over.
Summary
In today’s blog post, I discussed the concept of regularization and the impact it has on machine learning classifiers. Specifically, we use regularization to control overfitting and underfitting.
Regularization works by examining our weight matrix W and penalizing it if it does not confirm to the specified penalty function.
Applying this penalty helps ensure we learn a weight matrix W that generalizes better and thereby helps lesson the negative affects of overfitting.
In practice, you should apply hyperparameter tuning to determine:
- If regularization should be applied, and if so, which regularization method should be used.
- The strength of the regularization (i.e., the variable).
You may notice that applying regularization may actually decrease your training set classification accuracy — this is acceptable provided that your testing set accuracy increases, which would be a demonstration of regularization in action (i.e., avoiding/lessening the impact of overfitting).
In next week’s blog post, I’ll be discussing how to build a simple feedforward neural network using Python and Keras. Be sure to enter your email address in the form below to be notified when this blog post goes live!
Downloads:
The post Understanding regularization for image classification and machine learning appeared first on PyImageSearch.
from PyImageSearch http://ift.tt/2cjYWWT
via IFTTT
No comments:
Post a Comment