We build a regularized logistic regression classifier with a ridge (L2) regularization. We test this classifier on the MNIST data set by developing a classifiers: 0 versus all, 1 versus all, 2 versus all, ... , 9 versus all and running it one a loop for all the digits. After taking weights of all the digits, we decide on the highest weight and declare a winner.
Approach for this method:
Importing Packages and Data (Note: data is already 0-1 normalized)
import os
import gzip
import math
import operator
import sklearn.model_selection
import random
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as matplot
import matplotlib
%matplotlib inline
import pandas as pd
import numpy as np
import pickle as cPickle
import seaborn as sb
from time import time
from itertools import chain
from subprocess import check_output
from collections import Counter
from PIL import Image
from math import ceil
from scipy.stats import mode
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from pandas.tools.plotting import parallel_coordinates
from scipy import optimize as op
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data/')
train = mnist.train.images
validation = mnist.validation.images
test = mnist.test.images
trlab = mnist.train.labels
vallab = mnist.validation.labels
tslab = mnist.test.labels
train = np.concatenate((train, validation), axis=0)
trlab = np.concatenate((trlab, vallab), axis=0)
Preparing data for the Logistic regression function
#Number of examples
m = train.shape[0]
#Features
n = train.shape[1]
#Number of classes
k = 10
lab = [0,1,2,3,4,5,6,7,8,9]
intercept = np.ones((train.shape[0], 1))
X = np.hstack((intercept, train))
intercept = np.ones((test.shape[0], 1))
test = np.hstack((intercept, test))
Adding a column of 1s to the 784-length feature vector (the image values) so that we get the bias term with our classifier
X.shape
def error_rate(confusion_matrix):
a = confusion_matrix
b = a.sum(axis=1)
df = []
for i in range(0,10):
temp = 1-a[i][i]/b[i]
df.append(temp)
df = pd.DataFrame(df)
df.columns = ['% Error rate']
return df*100
The derivation for the Logistic regression updated function is given below
$$L(w) = \sum_{i=1}^{N} ln \left \{ 1 \ + exp(-y_{i}W^{T}X_{i})\right \} + \frac{\lambda}{2}|W|_{2}^{2}$$
$$\nabla L(w) = \sum_{i=1}^{N} \frac{-y_{i}X_{i} \ exp(-y_{i}W^{T}X_{i})}{1 \ + exp(-y_{i}W^{T}X_{i})} \ + \ \lambda W$$
$$W_{t+1} = W_{t} \ - \ \eta_{t} \nabla L(w)$$
#Logistic Regression
def sigmoid(z):
return 1.0 / (1 + np.exp(-z))
#Regularized cost function
def regCostFunction(theta, X, y, L = 100): # L is the lambda
theta = theta.reshape(-1, 1)
m = len(y)
h = sigmoid(X.dot(theta))
tmp = np.copy(theta)
tmp[0] = 0
reg = (L/(2*m)) * np.sum(tmp**2)
return (1 / m) * (-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1 - h))) + reg
#Regularized gradient function
def regGradient(theta, X, y, L = 100):
theta = theta.reshape(-1, 1)
m, n = X.shape
theta = theta.reshape((n, 1))
y = y.reshape((m, 1))
h = sigmoid(X.dot(theta))
tmp = np.copy(theta)
tmp[0] = 0
reg = L*tmp/m
return ((0.05 / m) * X.T.dot(h - y)) + reg
#Optimal theta
def logisticRegression(X, y, theta):
result = op.minimize(fun = regCostFunction, x0 = theta, args = (X, y),
method = 'TNC', jac = regGradient, options={'maxiter': 4000})
# options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000}
return result.x
#Training
all_theta = np.zeros((k, n + 1))
#One vs all
i = 0
for temp in lab:
#set the labels in -1 and 1
tmp_y = np.array(trlab == temp, dtype = int)
optTheta = logisticRegression(X, tmp_y, (np.ones((n + 1,1)) * -1))
all_theta[i] = optTheta
i += 1
$$L(w) = \sum_{i=1}^{N} ln \left \{ 1 \ + exp(-y_{i}W^{T}X_{i})\right \} + \frac{100}{2}|W|_{2}^{2}$$
$$\nabla L(w) = \sum_{i=1}^{N} \frac{-y_{i}X_{i} \ exp(-y_{i}W^{T}X_{i})}{1 \ + exp(-y_{i}W^{T}X_{i})} \ + \ 100 W$$
$$W_{t+1} = W_{t} \ - \ \eta_{t} \nabla L(w)$$
The graph has been plotted in log scale which does not have any representation for zero, so I assume 0.0001 as 0 and plot my graph accordingly. Also I decided to run 4000 max-iterations as playing with different values of lambda I think 4000 maxiter is a sweet spot between good accuracy and time complexity
Due to repetition of code and the similar values, I have graphed the various lambda with accuracy, and ran just the optimal value of Lambda
L = [0.0001,1,10,100,1000,10000]
acc = [85.62,83.91,83.91,87.77,74.45,73.96]
matplot.subplots(figsize=(10, 5))
matplot.semilogx(L, acc,'-gD' , label="Accuracy")
#matplot.xticks(L,L)
matplot.grid(True)
matplot.xlabel("Lambda")
matplot.ylabel("Test Accuracy")
matplot.legend()
matplot.title('Test accuracy versus the regularization value(log-scale)')
matplot.show()
We end up with:
#Predictions
P = sigmoid(test.dot(all_theta.T)) #probability for each temp
p = [lab[np.argmax(P[i, :])] for i in range(test.shape[0])]
print("Test Accuracy:", accuracy_score(tslab, p) * 100 , '%')
I believe test accuracy can increase to 90 % given enough iterations
(I was running into time constraints)
# X-axis Predicted vs Y-axis Actual Values
matplot.subplots(figsize=(10, 6))
sb.heatmap(confusion_matrix(tslab, p), annot = True, fmt = 'g')
matplot.xlabel("Predicted")
matplot.ylabel("Actual")
matplot.title("Confusion Matrix")
matplot.show()
error_rate(confusion_matrix(tslab, p))
We get a better accuracy than Naive Bayes, but not as good as kNN From this we can see the possible miss classifications like:
matplot.subplots(2,5, figsize=(24,10))
for i in range(10):
l1 = matplot.subplot(2, 5, i + 1)
l1.imshow(all_theta[i][1:785].reshape(28, 28), interpolation='nearest',cmap=matplot.cm.RdBu)
l1.set_xticks(())
l1.set_yticks(())
l1.set_xlabel('Class %i' % i)
matplot.suptitle('Image of the 784 weights for each 10 trained classifiers')
matplot.show()
Logisitc regression is better than naive bayes, but still is not way ahead of the curve. Though vectorising the function makes classification alot faster, it is still not that reliable. The reason being it doesnt make any pre-mature assumptions as naive bayes. But the regularizing function and the cost function affect a lot on how the model performs. Coding logistic regression is been particularly challenging to me compared to the previous other models.