Dhairya Kothari

MNIST with Logistic Regression

We build a regularized logistic regression classifier with a ridge (L2) regularization. We test this classifier on the MNIST data set by developing a classifiers: 0 versus all, 1 versus all, 2 versus all, ... , 9 versus all and running it one a loop for all the digits. After taking weights of all the digits, we decide on the highest weight and declare a winner.

Approach for this method:

  • Preprocess the data and prepare it for the functions
    • Preprocess includes 0-1 normalization (which this data set already is)
    • Add one column of 1s so that we capture the bias term
  • Define the sigmoid function
  • Define the cost function
  • Define the gradient decend
  • Define the logistic regression function
  • Train with training data set (One vs All classifiers- all 10)
  • Play with various Lambda, Eta and Max-iterations to get the global optimum values
  • Get the optimum test accuracy
  • Get the individual number accuracy
  • Plot the weight images

Importing Packages and Data (Note: data is already 0-1 normalized)

In [20]:
import os
import gzip
import math
import operator
import sklearn.model_selection
import random
import warnings 
warnings.filterwarnings('ignore')

import matplotlib.pyplot as matplot
import matplotlib
%matplotlib inline
import pandas as pd
import numpy as np
import pickle as cPickle
import seaborn as sb

from time import time
from itertools import chain
from subprocess import check_output
from collections import Counter
from PIL import Image
from math import ceil
from scipy.stats import mode
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from pandas.tools.plotting import parallel_coordinates

from scipy import optimize as op
In [21]:
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data/')
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
In [22]:
train = mnist.train.images
validation = mnist.validation.images
test = mnist.test.images

trlab = mnist.train.labels
vallab = mnist.validation.labels
tslab = mnist.test.labels

train = np.concatenate((train, validation), axis=0)
trlab = np.concatenate((trlab, vallab), axis=0)

Preparing data for the Logistic regression function

In [24]:
#Number of examples
m = train.shape[0]
#Features
n = train.shape[1]
#Number of classes
k = 10
lab = [0,1,2,3,4,5,6,7,8,9]

intercept = np.ones((train.shape[0], 1))
X = np.hstack((intercept, train))
intercept = np.ones((test.shape[0], 1))
test = np.hstack((intercept, test))

Adding a column of 1s to the 784-length feature vector (the image values) so that we get the bias term with our classifier

In [104]:
X.shape
Out[104]:
(60000, 785)
In [60]:
def error_rate(confusion_matrix):
    a = confusion_matrix
    b = a.sum(axis=1)
    df = []
    for i in range(0,10):
        temp = 1-a[i][i]/b[i]
        df.append(temp)
    
    df = pd.DataFrame(df)
    df.columns = ['% Error rate']
    return df*100

Defining the sigmoid, cost and gradient decent function. Then using them to to find the op.min for the logistic regresion

The derivation for the Logistic regression updated function is given below

$$L(w) = \sum_{i=1}^{N} ln \left \{ 1 \ + exp(-y_{i}W^{T}X_{i})\right \} + \frac{\lambda}{2}|W|_{2}^{2}$$

$$\nabla L(w) = \sum_{i=1}^{N} \frac{-y_{i}X_{i} \ exp(-y_{i}W^{T}X_{i})}{1 \ + exp(-y_{i}W^{T}X_{i})} \ + \ \lambda W$$

$$W_{t+1} = W_{t} \ - \ \eta_{t} \nabla L(w)$$

In [99]:
#Logistic Regression

def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))

#Regularized cost function
def regCostFunction(theta, X, y, L = 100): # L is the lambda
    theta = theta.reshape(-1, 1)
    m = len(y)
    h = sigmoid(X.dot(theta))
    tmp = np.copy(theta)
    tmp[0] = 0 
    reg = (L/(2*m)) * np.sum(tmp**2)

    return (1 / m) * (-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1 - h))) + reg

#Regularized gradient function
def regGradient(theta, X, y, L = 100):
    theta = theta.reshape(-1, 1) 
    m, n = X.shape
    theta = theta.reshape((n, 1))
    y = y.reshape((m, 1))
    h = sigmoid(X.dot(theta))
    tmp = np.copy(theta)
    tmp[0] = 0
    reg = L*tmp/m

    return ((0.05 / m) * X.T.dot(h - y)) + reg

#Optimal theta 
def logisticRegression(X, y, theta):
    result = op.minimize(fun = regCostFunction, x0 = theta, args = (X, y),
                         method = 'TNC', jac = regGradient, options={'maxiter': 4000})
    # options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000}
    return result.x

#Training

all_theta = np.zeros((k, n + 1))

#One vs all
i = 0
for temp in lab:
    #set the labels in -1 and 1
    tmp_y = np.array(trlab == temp, dtype = int)
    optTheta = logisticRegression(X, tmp_y, (np.ones((n + 1,1)) * -1))
    all_theta[i] = optTheta
    i += 1

Updated after Lambda values

$$L(w) = \sum_{i=1}^{N} ln \left \{ 1 \ + exp(-y_{i}W^{T}X_{i})\right \} + \frac{100}{2}|W|_{2}^{2}$$

$$\nabla L(w) = \sum_{i=1}^{N} \frac{-y_{i}X_{i} \ exp(-y_{i}W^{T}X_{i})}{1 \ + exp(-y_{i}W^{T}X_{i})} \ + \ 100 W$$

$$W_{t+1} = W_{t} \ - \ \eta_{t} \nabla L(w)$$

Test accuracy versus the Lambda value (regularization)

The graph has been plotted in log scale which does not have any representation for zero, so I assume 0.0001 as 0 and plot my graph accordingly. Also I decided to run 4000 max-iterations as playing with different values of lambda I think 4000 maxiter is a sweet spot between good accuracy and time complexity

Due to repetition of code and the similar values, I have graphed the various lambda with accuracy, and ran just the optimal value of Lambda

In [106]:
L = [0.0001,1,10,100,1000,10000]
acc = [85.62,83.91,83.91,87.77,74.45,73.96]

matplot.subplots(figsize=(10, 5))
matplot.semilogx(L, acc,'-gD' , label="Accuracy")
#matplot.xticks(L,L)
matplot.grid(True)
matplot.xlabel("Lambda")
matplot.ylabel("Test Accuracy")
matplot.legend()
matplot.title('Test accuracy versus the regularization value(log-scale)')
matplot.show()

We end up with:

  • Lambda = 100
  • Max-iter = 4000
  • Eta = 0.05
    Giving us a Test Accuracy of
In [100]:
#Predictions
P = sigmoid(test.dot(all_theta.T)) #probability for each temp
p = [lab[np.argmax(P[i, :])] for i in range(test.shape[0])]

print("Test Accuracy:", accuracy_score(tslab, p) * 100 , '%')
Test Accuracy: 87.77 %

I believe test accuracy can increase to 90 % given enough iterations
(I was running into time constraints)

In [107]:
# X-axis Predicted vs Y-axis Actual Values
matplot.subplots(figsize=(10, 6))
sb.heatmap(confusion_matrix(tslab, p), annot = True, fmt = 'g')
matplot.xlabel("Predicted")
matplot.ylabel("Actual")
matplot.title("Confusion Matrix")
matplot.show()
In [56]:
error_rate(confusion_matrix(tslab, p))
Out[56]:
% Error rate
0 3.571429
1 3.171806
2 18.023256
3 12.871287
4 10.896130
5 23.094170
6 7.098121
7 11.186770
8 18.275154
9 16.055500

We get a better accuracy than Naive Bayes, but not as good as kNN From this we can see the possible miss classifications like:

  • 4 being missclassified as 9 (alot)
  • The missclassification pattern is similar to Naive bayes
  • even though, 5 is missclassified the most time
  • Followed by 2

Image of the 784 weights for each 10 trained classifiers

In [59]:
matplot.subplots(2,5, figsize=(24,10))
for i in range(10):
    l1 = matplot.subplot(2, 5, i + 1)
    l1.imshow(all_theta[i][1:785].reshape(28, 28), interpolation='nearest',cmap=matplot.cm.RdBu)
    l1.set_xticks(())
    l1.set_yticks(())
    l1.set_xlabel('Class %i' % i)
matplot.suptitle('Image of the 784 weights for each 10 trained classifiers')
matplot.show()

It is not as clear as Naive bayes to make something out of these weights

Conclusion

Logisitc regression is better than naive bayes, but still is not way ahead of the curve. Though vectorising the function makes classification alot faster, it is still not that reliable. The reason being it doesnt make any pre-mature assumptions as naive bayes. But the regularizing function and the cost function affect a lot on how the model performs. Coding logistic regression is been particularly challenging to me compared to the previous other models.

---------------- End ----------------