Dhairya Kothari¶

MNIST with Logistic Regression¶

We build a regularized logistic regression classifier with a ridge (L2) regularization. We test this classifier on the MNIST data set by developing a classifiers: 0 versus all, 1 versus all, 2 versus all, ... , 9 versus all and running it one a loop for all the digits. After taking weights of all the digits, we decide on the highest weight and declare a winner.

Approach for this method:

Preprocess the data and prepare it for the functions
- Preprocess includes 0-1 normalization (which this data set already is)
- Add one column of 1s so that we capture the bias term
Define the sigmoid function
Define the cost function
Define the gradient decend
Define the logistic regression function
Train with training data set (One vs All classifiers- all 10)
Play with various Lambda, Eta and Max-iterations to get the global optimum values
Get the optimum test accuracy
Get the individual number accuracy
Plot the weight images

Importing Packages and Data (Note: data is already 0-1 normalized)

import os
import gzip
import math
import operator
import sklearn.model_selection
import random
import warnings 
warnings.filterwarnings('ignore')

import matplotlib.pyplot as matplot
import matplotlib
%matplotlib inline
import pandas as pd
import numpy as np
import pickle as cPickle
import seaborn as sb

from time import time
from itertools import chain
from subprocess import check_output
from collections import Counter
from PIL import Image
from math import ceil
from scipy.stats import mode
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from pandas.tools.plotting import parallel_coordinates

from scipy import optimize as op

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data/')

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz

train = mnist.train.images
validation = mnist.validation.images
test = mnist.test.images

trlab = mnist.train.labels
vallab = mnist.validation.labels
tslab = mnist.test.labels

train = np.concatenate((train, validation), axis=0)
trlab = np.concatenate((trlab, vallab), axis=0)

Preparing data for the Logistic regression function

#Number of examples
m = train.shape[0]
#Features
n = train.shape[1]
#Number of classes
k = 10
lab = [0,1,2,3,4,5,6,7,8,9]

intercept = np.ones((train.shape[0], 1))
X = np.hstack((intercept, train))
intercept = np.ones((test.shape[0], 1))
test = np.hstack((intercept, test))

Adding a column of 1s to the 784-length feature vector (the image values) so that we get the bias term with our classifier

X.shape

(60000, 785)

def error_rate(confusion_matrix):
    a = confusion_matrix
    b = a.sum(axis=1)
    df = []
    for i in range(0,10):
        temp = 1-a[i][i]/b[i]
        df.append(temp)
    
    df = pd.DataFrame(df)
    df.columns = ['% Error rate']
    return df*100

Defining the sigmoid, cost and gradient decent function. Then using them to to find the op.min for the logistic regresion¶

The derivation for the Logistic regression updated function is given below

$$L(w) = \sum_{i=1}^{N} ln \left \{ 1 \ + exp(-y_{i}W^{T}X_{i})\right \} + \frac{\lambda}{2}|W|_{2}^{2}$$

$$\nabla L(w) = \sum_{i=1}^{N} \frac{-y_{i}X_{i} \ exp(-y_{i}W^{T}X_{i})}{1 \ + exp(-y_{i}W^{T}X_{i})} \ + \ \lambda W$$

$$W_{t+1} = W_{t} \ - \ \eta_{t} \nabla L(w)$$

#Logistic Regression

def sigmoid(z):
    return 1.0 / (1 + np.exp(-z))

#Regularized cost function
def regCostFunction(theta, X, y, L = 100): # L is the lambda
    theta = theta.reshape(-1, 1)
    m = len(y)
    h = sigmoid(X.dot(theta))
    tmp = np.copy(theta)
    tmp[0] = 0 
    reg = (L/(2*m)) * np.sum(tmp**2)

    return (1 / m) * (-y.T.dot(np.log(h)) - (1 - y).T.dot(np.log(1 - h))) + reg

#Regularized gradient function
def regGradient(theta, X, y, L = 100):
    theta = theta.reshape(-1, 1) 
    m, n = X.shape
    theta = theta.reshape((n, 1))
    y = y.reshape((m, 1))
    h = sigmoid(X.dot(theta))
    tmp = np.copy(theta)
    tmp[0] = 0
    reg = L*tmp/m

    return ((0.05 / m) * X.T.dot(h - y)) + reg

#Optimal theta 
def logisticRegression(X, y, theta):
    result = op.minimize(fun = regCostFunction, x0 = theta, args = (X, y),
                         method = 'TNC', jac = regGradient, options={'maxiter': 4000})
    # options={'gtol': 1e-3, 'disp': True, 'maxiter': 1000}
    return result.x

#Training

all_theta = np.zeros((k, n + 1))

#One vs all
i = 0
for temp in lab:
    #set the labels in -1 and 1
    tmp_y = np.array(trlab == temp, dtype = int)
    optTheta = logisticRegression(X, tmp_y, (np.ones((n + 1,1)) * -1))
    all_theta[i] = optTheta
    i += 1

Updated after Lambda values¶

$$L(w) = \sum_{i=1}^{N} ln \left \{ 1 \ + exp(-y_{i}W^{T}X_{i})\right \} + \frac{100}{2}|W|_{2}^{2}$$

$$\nabla L(w) = \sum_{i=1}^{N} \frac{-y_{i}X_{i} \ exp(-y_{i}W^{T}X_{i})}{1 \ + exp(-y_{i}W^{T}X_{i})} \ + \ 100 W$$

$$W_{t+1} = W_{t} \ - \ \eta_{t} \nabla L(w)$$

Test accuracy versus the Lambda value (regularization)¶

The graph has been plotted in log scale which does not have any representation for zero, so I assume 0.0001 as 0 and plot my graph accordingly. Also I decided to run 4000 max-iterations as playing with different values of lambda I think 4000 maxiter is a sweet spot between good accuracy and time complexity

Due to repetition of code and the similar values, I have graphed the various lambda with accuracy, and ran just the optimal value of Lambda

L = [0.0001,1,10,100,1000,10000]
acc = [85.62,83.91,83.91,87.77,74.45,73.96]

matplot.subplots(figsize=(10, 5))
matplot.semilogx(L, acc,'-gD' , label="Accuracy")
#matplot.xticks(L,L)
matplot.grid(True)
matplot.xlabel("Lambda")
matplot.ylabel("Test Accuracy")
matplot.legend()
matplot.title('Test accuracy versus the regularization value(log-scale)')
matplot.show()

We end up with:

Lambda = 100
Max-iter = 4000
Eta = 0.05
Giving us a Test Accuracy of

#Predictions
P = sigmoid(test.dot(all_theta.T)) #probability for each temp
p = [lab[np.argmax(P[i, :])] for i in range(test.shape[0])]

print("Test Accuracy:", accuracy_score(tslab, p) * 100 , '%')

Test Accuracy: 87.77 %

I believe test accuracy can increase to 90 % given enough iterations
(I was running into time constraints)

# X-axis Predicted vs Y-axis Actual Values
matplot.subplots(figsize=(10, 6))
sb.heatmap(confusion_matrix(tslab, p), annot = True, fmt = 'g')
matplot.xlabel("Predicted")
matplot.ylabel("Actual")
matplot.title("Confusion Matrix")
matplot.show()

error_rate(confusion_matrix(tslab, p))

We get a better accuracy than Naive Bayes, but not as good as kNN From this we can see the possible miss classifications like:

4 being missclassified as 9 (alot)
The missclassification pattern is similar to Naive bayes
even though, 5 is missclassified the most time
Followed by 2

Image of the 784 weights for each 10 trained classifiers¶

matplot.subplots(2,5, figsize=(24,10))
for i in range(10):
    l1 = matplot.subplot(2, 5, i + 1)
    l1.imshow(all_theta[i][1:785].reshape(28, 28), interpolation='nearest',cmap=matplot.cm.RdBu)
    l1.set_xticks(())
    l1.set_yticks(())
    l1.set_xlabel('Class %i' % i)
matplot.suptitle('Image of the 784 weights for each 10 trained classifiers')
matplot.show()

It is not as clear as Naive bayes to make something out of these weights¶

Conclusion¶

Logisitc regression is better than naive bayes, but still is not way ahead of the curve. Though vectorising the function makes classification alot faster, it is still not that reliable. The reason being it doesnt make any pre-mature assumptions as naive bayes. But the regularizing function and the cost function affect a lot on how the model performs. Coding logistic regression is been particularly challenging to me compared to the previous other models.

	% Error rate
0	3.571429
1	3.171806
2	18.023256
3	12.871287
4	10.896130
5	23.094170
6	7.098121
7	11.186770
8	18.275154
9	16.055500