25. Machine Learning II: Categorization Algorithm¶

In this chapter we cover a categorization algorithm that can recognize hand written letters. This is an example of a supervised, batch learning, model based algorithm.

This chapter is heavily built on chapter 3 in [Geron2019] Hands-On Machine Learning with Scikit-Learn, Kera & TensorFlow. You can find a link to the GitHub page of this textbook at Geron GitHub

25.1. A Simple Classification Model¶

25.1.1. Downloading the Data¶

We first need to import important machine learning libraries.

import numpy as np
from sklearn.datasets import fetch_openml
import pickle
import matplotlib as mpl
import matplotlib.pyplot as plt

# These are all the functions we use in these codes.
# We import them at the appropriate place further down.
#from sklearn.linear_model import SGDClassifier
#from sklearn.model_selection import cross_val_score
#from sklearn.model_selection import cross_val_predict
#from sklearn.metrics import confusion_matrix
#from sklearn.metrics import precision_score, recall_score
#from sklearn.metrics import f1_score
#from sklearn.metrics import precision_recall_curve
#from sklearn.metrics import roc_curve
#from sklearn.metrics import roc_auc_score
#from sklearn.ensemble import RandomForestClassifier

We next define two functions that will help us save and load data in Python’s pickle format. Think of this as a compressed data format.

def f_save_obj(obj, name ):
    with open(name + '.pkl', 'wb') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)

def f_load_obj(name ):
    with open(name + '.pkl', 'rb') as f:
        return pickle.load(f)

We next download the data from an online repository for machine learning data.

The following code sample will use a built in command to download the dataset. It will take a while so be patient:

# Download data: Takes a while. Only do it once.
mnist = fetch_openml('mnist_784', version=1, cache=True)
f_save_obj(mnist, 'mnist_dict_raw') # save dictionary

The second line is going to save the data as pickle format.

If you don’t want to download the data from the code repository, you can also just downloaded the pickled data from my website Machine_Learning_Number_Data.pkl. In my example below, the data is stored in a subfolder called Lecture_MachineLearning_1. After loading the data, we extract the numpy arrays for our labels (y values stored in a vector) and our attributes (X values stored in a matrix, or 2 dimensional array).

# Load data
mydata = f_load_obj('Lecture_MachineLearning_2/mnist_dict_raw') # load dictionary

X, y = mydata["data"], mydata["target"]

25.1.2. Training the Model¶

We can now just check one of the 60,000 digits, say the 36,001 element and plot it using the imshow command from the matplotlib library.

some_digit = X[36000]
some_digit_image = some_digit.reshape(28, 28)
plt.imshow(some_digit_image, cmap = mpl.cm.binary,
           interpolation="nearest")
plt.axis("off")
plt.show()

_images/Slides_MachineLearning_2_figure4_1.png

As you can see, the 36,001st element is the handwritten number 9.

If you inspect the y-vector you will see that the values are stored as strings. However, all machine learning commands from the skleanr library require numerical inputs. We therefore have to transform the strings into integer numbers.

y = y.astype(np.uint8) # label is string, transform into number

You can now check the type of vector y and convince yourself that it is now a numpy array, i.e., a vector with only integer values.

We next split the sample into a training sample that we use for training (i.e., estimating) our model and into a test sample that we use to evaluate how good our predictions are. We also random shuffle the entries in the training data.

X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]

shuffle_index = np.random.permutation(60000)
X_train, y_train = X_train[shuffle_index], y_train[shuffle_index]

For our first exercise, we will train an algorithm that tries to identify the number 5. We therefore generate a dummy vector that has only 0 and 1 values in it. It has value 1 whenever the handwritten number is a 5 and it has a value of zero if it is not a handwritten number 5.

# Binary classifier
y_train_5 = (y_train == 5)

y_test_5 = (y_test == 5)

We are now ready to define and subsequently train (i.e., estimate) our model using the training sample, that is the first 60,000 values of the total sample that we downloaded earlier.

from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=5, tol=-np.infty, random_state=42)
sgd_clf.fit(X_train, y_train_5)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0,
fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='hinge',
max_iter=5,
              n_iter_no_change=5, n_jobs=None, penalty='l2',
power_t=0.5,
              random_state=42, shuffle=True, tol=-inf,
validation_fraction=0.1,
              verbose=0, warm_start=False)

25.1.3. Making Predictions¶

We can now use our trained (or estimated) classification model to make predictions of whether a handwritten number might be the number 5 or not. Let’s try this with the 36,001st handwritten number from our training set. We already know that this number is the number 9, but let’s see whether our machine learning algorithm can correctly identify it as not the number 5. Remember, all that our algorithm can regonize at the moment (with some error of course) is the number 5. So if we hand in the pixel data for a handwritten number 9, our prediction should say false because it is NOT the number 5.

print(sgd_clf.predict(some_digit.reshape(1,-1)))

[False]

Et voila, our trained model has correctly identified the number as NOT 5.

We next run through a battery of checks that determine how well our model can classify handwritten numbers of value 5.

from sklearn.model_selection import cross_val_score
print(cross_val_score(sgd_clf, X_train, y_train_5, cv=3, scoring="accuracy"))

[0.96285 0.9648  0.96285]

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3)

from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_train_5, y_train_pred))

[[53689   890]
 [ 1300  4121]]

y_train_perfect_predictions = y_train_5
print(confusion_matrix(y_train_5, y_train_perfect_predictions))

[[54579     0]
 [    0  5421]]

from sklearn.metrics import precision_score, recall_score
print(precision_score(y_train_5, y_train_pred))
print(recall_score(y_train_5, y_train_pred))

0.8223907403711834
0.7601918465227818

from sklearn.metrics import f1_score
print(f1_score(y_train_5, y_train_pred))

0.790069018404908

25.1.4. Precision Recall Trade-off¶

y_scores = sgd_clf.decision_function([some_digit])
print(y_scores)

[-408634.50109637]

threshold = 0
y_some_digit_pred = (y_scores > threshold)
print(y_some_digit_pred)

[False]

threshold = 200000
y_some_digit_pred = (y_scores > threshold)
print(y_some_digit_pred)

[False]

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")

# hack to work around issue #9589 in Scikit-Learn 0.19.0
if y_scores.ndim == 2:
    y_scores = y_scores[:, 1]

With these scores we can now use the precision_recall_curve() function from the sklearn.metrics library.

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

def plot_precision_recall_vs_threshold(precisions, recalls, thresholds):
    plt.plot(thresholds, precisions[:-1], "b--", label="Precision", linewidth=2)
    plt.plot(thresholds, recalls[:-1], "g-", label="Recall", linewidth=2)
    plt.xlabel("Threshold", fontsize=16)
    plt.legend(loc="upper left", fontsize=16)
    plt.ylim([0, 1])

plt.figure(figsize=(8, 4))
plot_precision_recall_vs_threshold(precisions, recalls, thresholds)
plt.xlim([-700000, 700000])
# save_fig("precision_recall_vs_threshold_plot")
plt.show()

_images/Slides_MachineLearning_2_figure20_1.png

(y_train_pred == (y_scores > 0)).all()

y_train_pred_90 = (y_scores > 70000)

print(precision_score(y_train_5, y_train_pred_90))

print(recall_score(y_train_5, y_train_pred_90))

0.896551724137931
0.6378896882494005

def plot_precision_vs_recall(precisions, recalls):
    plt.plot(recalls, precisions, "b-", linewidth=2)
    plt.xlabel("Recall", fontsize=16)
    plt.ylabel("Precision", fontsize=16)
    plt.axis([0, 1, 0, 1])

plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
# save_fig("precision_vs_recall_plot")
plt.show()

_images/Slides_MachineLearning_2_figure22_1.png

25.1.5. ROC curves¶

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)

def plot_roc_curve(fpr, tpr, label=None):
    plt.plot(fpr, tpr, linewidth=2, label=label)
    plt.plot([0, 1], [0, 1], 'k--')
    plt.axis([0, 1, 0, 1])
    plt.xlabel('False Positive Rate', fontsize=16)
    plt.ylabel('True Positive Rate', fontsize=16)

plt.figure(figsize=(8, 6))
plot_roc_curve(fpr, tpr)
# save_fig("roc_curve_plot")
plt.show()

_images/Slides_MachineLearning_2_figure23_1.png

from sklearn.metrics import roc_auc_score
print(roc_auc_score(y_train_5, y_scores))

0.9608926315517949

Note

We set n_estimators=10 to avoid a warning about the fact that its default value will be set to 100 in Scikit-Learn 0.22.

from sklearn.ensemble import RandomForestClassifier
forest_clf = RandomForestClassifier(n_estimators=10, random_state=42)
y_probas_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3,
                                    method="predict_proba")

y_scores_forest = y_probas_forest[:, 1] # score = proba of positive class
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_train_5,y_scores_forest)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, "b:", linewidth=2, label="SGD")
plot_roc_curve(fpr_forest, tpr_forest, "Random Forest")
plt.legend(loc="lower right", fontsize=16)
# save_fig("roc_curve_comparison_plot")
plt.show()

_images/Slides_MachineLearning_2_figure25_1.png

print(roc_auc_score(y_train_5, y_scores_forest))

0.9932680166070984

y_train_pred_forest = cross_val_predict(forest_clf, X_train, y_train_5, cv=3)
precision_score(y_train_5, y_train_pred_forest)

print(recall_score(y_train_5, y_train_pred_forest))

0.8271536616860358

25.2. Key Concepts and Summary¶

Note

Machine learning
The basic
Some central

25.3. Self-Check Questions¶

Todo

Why is regression analysis machine learning. What type of machine learning is it?

Geron2019: Geron, Aurelien (2019), Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow, O’Reilly, 2nd edition.