Describe how the multi-class classification is different for SVC and LinearSVC. Be explicit, don't just describe what's in the documentation. For example, what does 'one-against-one' and 'one-vs-the-rest' mean?
The one-against-one classifier trains binary classifer for N class (multi class) data set. Each classifier recieves a pair of classes from the training set and we learn to classify between these two labels/classes. On the other hand, One versus Rest approach, we train on classifier per class, with the samples from that class labelled as Postive class and the rest as Negative class, and repeating these N times gives us a N class classifier. Now, all the samples are given Weights (probablity) for each class and from them we choose a winner class, giving the final Label. In order to perform Multi class classification we need to transform into a set of binary classification problem. When it comes to multi class classification The main difference between SVC and LinearSVC is they use One Vs One and One Vs Rest approach. One clear difference in SVC and Linear SVC is: SVC offers us different Kernels (rbf or poly) while LinearSVC just produces a linear margin of seperation. While in SVC the max iterations are infinite, LinearSVC limits them to 1000.
from scipy.stats import mode
import numpy as np
#from mnist import MNIST
from time import time
import pandas as pd
import os
import matplotlib.pyplot as matplot
import matplotlib
%matplotlib inline
import random
matplot.rcdefaults()
from IPython.display import display, HTML
from itertools import chain
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import seaborn as sb
from sklearn.model_selection import ParameterGrid
from sklearn.svm import SVC, LinearSVC
import warnings
warnings.filterwarnings('ignore')
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data/')
train = mnist.train.images
validation = mnist.validation.images
test = mnist.test.images
trlab = mnist.train.labels
vallab = mnist.validation.labels
tslab = mnist.test.labels
train = np.concatenate((train, validation), axis=0)
trlab = np.concatenate((trlab, vallab), axis=0)
We save a lot of compute time by keeping the data that way and we dont lose any significant amount of accuracy
Running a Sample Linear SVM classifier on default values to see how the model does on MNIST data
svm = LinearSVC(dual=False)
svm.fit(train, trlab)
svm.coef_
svm.intercept_
pred = svm.predict(test)
accuracy_score(tslab, pred) # Accuracy
cm = confusion_matrix(tslab, pred)
matplot.subplots(figsize=(10, 6))
sb.heatmap(cm, annot = True, fmt = 'g')
matplot.xlabel("Predicted")
matplot.ylabel("Actual")
matplot.title("Confusion Matrix")
matplot.show()
As we can see that the SVM does a pretty decent job at classifying, we still get the usual misclassification on 5-8, 2-8, 5-3, 4-9. However, accuracy of 91.82% is good
acc = []
acc_tr = []
coefficient = []
for c in [0.0001,0.001,0.01,0.1,1,10,100,1000,10000]:
svm = LinearSVC(dual=False, C=c)
svm.fit(train, trlab)
coef = svm.coef_
p_tr = svm.predict(train)
a_tr = accuracy_score(trlab, p_tr)
pred = svm.predict(test)
a = accuracy_score(tslab, pred)
coefficient.append(coef)
acc_tr.append(a_tr)
acc.append(a)
c = [0.0001,0.001,0.01,0.1,1,10,100,1000,10000]
matplot.subplots(figsize=(10, 5))
matplot.semilogx(c, acc,'-gD' ,color='red' , label="Testing Accuracy")
matplot.semilogx(c, acc_tr,'-gD' , label="Training Accuracy")
#matplot.xticks(L,L)
matplot.grid(True)
matplot.xlabel("Cost Parameter C")
matplot.ylabel("Accuracy")
matplot.legend()
matplot.title('Accuracy versus the Cost Parameter C (log-scale)')
matplot.show()
We clearly see a bias variance trade off in the graph. As the cost increases, the Training accuracy increases, so as the test accuracy, but only till c=1, then we see over fitting. From, c=10 to 1000 we see the model overfitting and we see Low Bias and High Variance
So as we go from Left to Right: Bias Decreases and Variance Increases
svm_coef = coefficient[4]
svm_coef.shape
matplot.subplots(2,5, figsize=(24,10))
for i in range(10):
l1 = matplot.subplot(2, 5, i + 1)
l1.imshow(svm_coef[i].reshape(28, 28), cmap=matplot.cm.RdBu)
l1.set_xticks(())
l1.set_yticks(())
l1.set_xlabel('Class %i' % i)
matplot.suptitle('Class Coefficients')
matplot.show()
These images look nothing like the images we saw in Logistic regression or Naive Bayes. In Naive Bayes, the underlying number was clearly visible, while in Logistice regression the pattern seemed quite distinct between all the classes. However, here you dont see any apparant patterns or distinctness and really hard to differentiate.
acc = []
acc_tr = []
coefficient = []
for c in [0.0001,0.001,0.01,0.1,1,10,100,1000,10000]:
svm = LinearSVC(dual=False, C=c, penalty='l1')
svm.fit(train, trlab)
coef = svm.coef_
p_tr = svm.predict(train)
a_tr = accuracy_score(trlab, p_tr)
pred = svm.predict(test)
a = accuracy_score(tslab, pred)
coefficient.append(coef)
acc_tr.append(a_tr)
acc.append(a)
c = [0.0001,0.001,0.01,0.1,1,10,100,1000,10000]
matplot.subplots(figsize=(10, 5))
matplot.semilogx(c, acc,'-gD' ,color='red' , label="Testing Accuracy")
matplot.semilogx(c, acc_tr,'-gD' , label="Training Accuracy")
#matplot.xticks(L,L)
matplot.grid(True)
matplot.xlabel("Cost Parameter C")
matplot.ylabel("Accuracy")
matplot.legend()
matplot.title('Accuracy versus the Cost Parameter C (log-scale)')
matplot.show()
Exact same thing with just a slight difference is clearly observed here as well. We see a bias variance trade off in the graph. As the cost increases, the Training accuracy increases, so as the test accuracy, but only till c=1, then we see over fitting. From, c=10 to 1000 we see the model overfitting and we see Low Bias and High Variance. Only thing is with L1 Penalty we have a lesser effect of overfitting and the model performs really poorly with lesser cost values.
Again, as we go from Left to Right: Bias Decreases and Variance Increases
svm_coef = coefficient[4]
svm_coef.shape
matplot.subplots(2,5, figsize=(24,10))
for i in range(10):
l1 = matplot.subplot(2, 5, i + 1)
l1.imshow(svm_coef[i].reshape(28, 28), cmap=matplot.cm.RdBu)
l1.set_xticks(())
l1.set_yticks(())
l1.set_xlabel('Class %i' % i)
matplot.suptitle('Class Coefficients')
matplot.show()
It reflects my views on Linear SVC with L2 (default) penalty, these images look very vaguely like the original images, also different than we saw in Logistic regression or Naive Bayes. In Naive Bayes, the underlying number was clearly visible, while in Logistice regression the pattern seemed quite distinct between all the classes. However, here you dont see any apparant patterns or distinctness and hard to differentiate between classes. However, we can also interpret some numbers like 0,5,6,8
Another important observation is, the images are also different looking than its L2 siblings as well, not by a bigger margin, but still different none the less.
from scipy.stats import mode
import numpy as np
from time import time
import pandas as pd
import os
import matplotlib.pyplot as matplot
import matplotlib
%matplotlib inline
import random
matplot.rcdefaults()
from IPython.display import display, HTML
from itertools import chain
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import seaborn as sb
from sklearn.model_selection import ParameterGrid
from sklearn.svm import SVC, LinearSVC
import warnings
warnings.filterwarnings('ignore')
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data/')
train = mnist.train.images
validation = mnist.validation.images
test = mnist.test.images
trlab = mnist.train.labels
vallab = mnist.validation.labels
tslab = mnist.test.labels
train = np.concatenate((train, validation), axis=0)
trlab = np.concatenate((trlab, vallab), axis=0)
We choose sampling as running these many models would increase the time complexity as well. SO keeping time constraint in mind I sample 10% of the data.
#generating a random sequence for sampling
seq = np.random.randint(0,60000,6000)
train_samp = train[seq]
trlab_samp = trlab[seq]
train_samp.shape
trlab_samp.shape
seq = np.random.randint(0,10000,1000)
test_samp = test[seq]
tslab_samp = tslab[seq]
test_samp.shape
tslab_samp.shape
fig, ax = matplot.subplots(1,2, figsize=(10,4))
ax[0].hist(trlab_samp)
ax[1].hist(trlab)
fig.show
matplot.show()
coefficient = []
n_supp = []
sup_vec = []
i = 0
df = pd.DataFrame(columns = ['c','gamma','train_acc','test_acc'])
for c in [0.01, 0.1, 1, 10, 100]:
for g in [0.01, 0.1, 1, 10, 100]:
svm = SVC(kernel='rbf', C=c, gamma=g)
model = svm.fit(train_samp, trlab_samp)
globals()['model%s' % i] = model
d_coef = svm.dual_coef_
support = svm.n_support_
sv = svm.support_
p_tr = svm.predict(train_samp)
a_tr = accuracy_score(trlab_samp, p_tr)
pred = svm.predict(test_samp)
a = accuracy_score(tslab_samp, pred)
coefficient.append(d_coef)
n_supp.append(support)
sup_vec.append(sv)
df.loc[i] = [c,g,a_tr,a]
i=i+1
df
Comment on the bias and variance of the SVC classifier with respect to C and gamma. Comment on the results overall in comparison to LinearSVC. What values would you choose?
We see a bias variance trade off in the table. As the cost and gamma increases, the Testing accuracy decreases, as we see over fitting. We see the model overfitting and we see Low Bias and High Variance. Interesting thing is, keeping the cost constant, we increase the Gamma we get immediate overfitting. And the cost behaves the same as it did in Linear SVC only difference being, the best model performance here is C=10 and Gamma=0.01
So as we increase the Cost: Bias Decreases and Variance Increases
So as we increase the Gamma: Bias Decreases and Variance Increases
pd.DataFrame(coefficient[15]) # dual_coef_
The support vectors identified by the SVC each belong to a certain class (0 to 9). In the dual coefficients, they are ordered according to the class they belong to. The support vectors are organized according to these two variables. Each support vector being clearly identified with one class, it becomes evident that it can be implied in at most n_classes-1 one-vs-one problems, viz every comparison with all the other classes. But it is entirely possible that a given support vector will not be implied in all one-vs-one problems. SVC also gives you the weights of the support vectors for the classes 0, 1, ..., 9 in their respective one-vs-one problems. Comparisons to all other classes except its own are made, resulting in n_classes - 1 i.e. 9 columns. The order in which this happens follows the order of the unique classes exposed above. There are as many rows in each group as there are support vectors i.e. 2477.
pd.DataFrame(n_supp[15]) # n_support_
"nsupport" divides the number of support vestors by the class. So we can say that when class 0 has 180 support vectors, it means 180 are the positive support vectors and rest all are the negative support vectors for 0-versus-rest classifier.
ind = 0
matplot.subplots(2,5, figsize=(24,10))
for i in range(len(n_supp[15])):
l1 = matplot.subplot(2, 5, i + 1)
sv_image = train_samp[sup_vec[15][ind:ind+n_supp[15][i]]][0]
l1.imshow(sv_image.reshape(28, 28), cmap=matplot.cm.RdBu)
l1.set_xticks(())
l1.set_yticks(())
l1.set_xlabel('Class %i vs All' % i)
ind = ind + n_supp[15][i]
matplot.suptitle('Support Vectors for Positive Classes')
matplot.show()
ind = n_supp[15][0]
matplot.subplots(2,5, figsize=(24,10))
for i in range(len(n_supp[15])-1):
l1 = matplot.subplot(2, 5, i + 1)
sv_image = train_samp[sup_vec[15][ind:ind+n_supp[15][i+1]]][100]
l1.imshow(sv_image.reshape(28, 28), cmap=matplot.cm.RdBu)
l1.set_xticks(())
l1.set_yticks(())
l1.set_xlabel('Class %i vs All' % i)
ind = ind + n_supp[15][i+1]
ind = 0
l1 = matplot.subplot(2, 5, 10)
sv_image = train_samp[sup_vec[15][ind:ind+n_supp[15][0]]][100]
l1.imshow(sv_image.reshape(28, 28), cmap=matplot.cm.RdBu)
l1.set_xticks(())
l1.set_yticks(())
l1.set_xlabel('Class 9 vs All')
matplot.suptitle('Support Vectors for Negative Classes')
matplot.show()
from scipy.stats import mode
import numpy as np
from time import time
import pandas as pd
import os
import matplotlib.pyplot as matplot
import matplotlib
%matplotlib inline
import random
matplot.rcdefaults()
from IPython.display import display, HTML
from itertools import chain
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import seaborn as sb
from sklearn.model_selection import ParameterGrid
from sklearn.svm import SVC, LinearSVC
import warnings
warnings.filterwarnings('ignore')
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data/')
train = mnist.train.images
validation = mnist.validation.images
test = mnist.test.images
trlab = mnist.train.labels
vallab = mnist.validation.labels
tslab = mnist.test.labels
train = np.concatenate((train, validation), axis=0)
trlab = np.concatenate((trlab, vallab), axis=0)
#generating a random sequence for sampling
seq = np.random.randint(0,60000,6000)
train_samp = train[seq]
trlab_samp = trlab[seq]
train_samp.shape
trlab_samp.shape
seq = np.random.randint(0,10000,1000)
test_samp = test[seq]
tslab_samp = tslab[seq]
test_samp.shape
tslab_samp.shape
coefficient = []
n_supp = []
sup_vec = []
i = 0
df = pd.DataFrame(columns = ['c','degree','train_acc','test_acc'])
for c in [0.01, 0.1, 1, 10, 100]:
for d in [2,3,4,5,6]:
svm = SVC(kernel='poly', C=c, degree=d)
model = svm.fit(train_samp, trlab_samp)
globals()['model%s' % i] = model
d_coef = svm.dual_coef_
support = svm.n_support_
sv = svm.support_
p_tr = svm.predict(train_samp)
a_tr = accuracy_score(trlab_samp, p_tr)
pred = svm.predict(test_samp)
a = accuracy_score(tslab_samp, pred)
coefficient.append(d_coef)
n_supp.append(support)
sup_vec.append(sv)
df.loc[i] = [c,d,a_tr,a]
i=i+1
df
Comment on the bias and variance of the SVC classifier with respect to C and gamma. Comment on the results overall in comparison to LinearSVC. What values would you choose?
We also see that the polynomial kernel behaves very wierdly, the reason being, change in cost and degree effects the entire polynomial hyperplane rather than localised hyperplane which is the case in rbf kernel, hence poly kernel is less stable than rbf.
We see a bias variance trade off in the table. As the cost and degree increases, the Testing accuracy decreases, as we see over fitting. We see the model overfitting and we see Low Bias and High Variance. Interesting thing is, keeping the cost constant, we increase the degree we get immediate overfitting. And the cost behaves the same as it did in Linear SVC only difference being, the best model performance here is C=100 and degree=2
So as we increase the Cost: Bias Decreases and Variance Increases
So as we increase the Degree: Bias Decreases and Variance Increases
pd.DataFrame(coefficient[20]) # dual_coef_
pd.DataFrame(n_supp[20]) # n_support_
ind = 0
matplot.subplots(2,5, figsize=(24,10))
for i in range(len(n_supp[20])):
l1 = matplot.subplot(2, 5, i + 1)
sv_image = train_samp[sup_vec[20][ind:ind+n_supp[20][i]]][0]
l1.imshow(sv_image.reshape(28, 28), cmap=matplot.cm.RdBu)
l1.set_xticks(())
l1.set_yticks(())
l1.set_xlabel('Class %i vs All' % i)
ind = ind + n_supp[20][i]
matplot.suptitle('Support Vectors for Positive Classes')
matplot.show()
ind = n_supp[20][0]
matplot.subplots(2,5, figsize=(24,10))
for i in range(len(n_supp[20])-1):
l1 = matplot.subplot(2, 5, i + 1)
sv_image = train_samp[sup_vec[20][ind:ind+n_supp[20][i+1]]][100]
l1.imshow(sv_image.reshape(28, 28), cmap=matplot.cm.RdBu)
l1.set_xticks(())
l1.set_yticks(())
l1.set_xlabel('Class %i vs All' % i)
ind = ind + n_supp[20][i+1]
ind = 0
l1 = matplot.subplot(2, 5, 10)
sv_image = train_samp[sup_vec[20][ind:ind+n_supp[20][0]]][100]
l1.imshow(sv_image.reshape(28, 28), cmap=matplot.cm.RdBu)
l1.set_xticks(())
l1.set_yticks(())
l1.set_xlabel('Class 9 vs All')
matplot.suptitle('Support Vectors for Negative Classes')
matplot.show()
Linear SVC (best performance): 92 %
SVC rbf (best performance): 96.4 %
SVC poly (best performance): 94.5 %
Logistic regression (prev assignment): 89 %
Naive Bayes (prev assignment): 81 %
It is clear that SVM with 'rbf' kernel gives the best result among all these models.