Instruction
Please remove this section when
submitting your homework.
Students are encouraged to work together on homework and/or utilize
advanced AI tools. However, sharing, copying, or providing any
part of a homework solution or code to others is an infraction
of the University’s
rules on Academic Integrity. Any violation will be punished as
severely as possible. Final submissions must be uploaded to Gradescope. No
email or hard copy will be accepted. For late
submission policy and grading rubrics, please refer to the
course website.
- You are required to submit the rendered file
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading.
- Include your Name and NetID in the report.
- If you use this file or the example homework
.Rmd
file
as a template, be sure to remove this instruction
section.
- Make sure that you set seed properly so that the
results can be replicated if needed.
- For some questions, there will be restrictions on what
packages/functions you can use. Please read the requirements carefully.
As long as the question does not specify such restrictions, you can use
anything.
- When using AI tools, you are encouraged to document
your comment on your experience with AI tools especially when it’s
difficult for them to grasp the idea of the question.
- On random seed and reproducibility: Make sure the
version of your
R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.
Question 1: Linear SVM on Hand Written Digit Data
Load the MNIST dataset, the same way as HW5.
# readin the data
mnist <- read.csv("https://pjreddie.com/media/files/mnist_train.csv", nrows = 2000)
colnames(mnist) = c("Digit", paste("Pixel", seq(1:784), sep = ""))
save(mnist, file = "mnist_first2000.RData")
# you can load the data with the following code
# load("mnist_first2000.RData")
dim(mnist)
## [1] 2000 785
- [15 pts] Since a standard SVM can only be used for binary
classification problems, lets fit SVM on digits 1 and 2. Complete the
following tasks.
- Use the 1 and 2 digits in the first 1000 observations as training
data and those in the remaining part as testing data.
- Fit a linear SVM on the training data using the
e1071
package. Set the cost parameter \(C =
1\).
- You will possibly encounter two issues: first, this is very slow
(unless your computer is very powerful); second, the package will
complain about some pixels being problematic (zero variance). Hence,
reducing the number of variables by removing pixels with low variances
is probably a good idea. Perform a marginal screening of variance on the
pixels and select the top 300 Pixels with the highest variance.
- Redo your SVM model with the pixels you have selected. Report the
training and testing errors.
- [15 pts] Some researchers might be interested in knowing what pixels
are more important in distinguishing the two digits. One way to do this
is to look at (calculate) the coefficients of the linear SVM model.
Complete the following tasks.
- Extract the coefficients of the linear SVM model you have fitted in
part a).
- Find the top 10 pixels with the largest (absolute)
coefficients.
- [10 pts] Perform Principal Component Analysis (PCA) on the training
data.
- Plot the data on the first two principal components. Color the
points by their digits.
- Plot the first principal component against the linear separation
rule \(x^T \beta + \beta_0\) of the
linear SVM model. Are they similar? However, these two methods are
completely different. One is a supervised learning, the other one is
unsupervised. Can you explain why?
- [10 pts] Perform a logistic regression with elastic net penalty
(\(\alpha =\) 0.5) on the training
data.
- Use the same 300 pixels you have selected in part a). Tune the
penalty parameter \(\lambda\) using
10-fold cross validation.
- Plot the linear link function of the logistic regression model
against the linear separation rule \(x^T \beta
+ \beta_0\) of the linear SVM model. Are they similar?
Question 2: Multi-class SVM
[25 pts] Our current SVM is only applicable to binary classification
problems. In this question, we will extend it to multi-class
classification problems. A simple idea is called one-vs-one (OVO)
classification. For example, if we have 3 classes, we can fit 3 SVMs,
each of which is trained on two classes. For a new observation, we throw
that into each of the 3 SVMs and obtain three predictions. We then use
the majority vote to determine its class. Carry out this approach using
digits 1, 6, and 7 in our MNIST data. You still need to select the top
pixels with the highest variance to avoid unnecessary warnings. But in
this question, use only 100 pixels. For all models, keep the cost
parameter \(C = 1\).
Question 3: Nonlinear SVM
[25 pts] Load the spam
dataset from the
kernlab
package. In this is a classification example with
consists of 4,601 instances and 57 features. The response variable is
whether an email is spam or not. Use a nonlinear SVM with the Radial
Basis Function (RBF) kernel. Evaluate the performance of the trained
model. Complete the following tasks
- Load the spam dataset from the
kernlab
package and
split the dataset into training (70%) and testing (30%) sets. Keep a
seed.
- Fit a nonlinear SVM with the RBF kernel on the training data. Tune
the cost parameter \(C\) and the kernel
parameter \(\sigma\) using 10-fold
cross validation. You should consider the
caret
package for
this task. You may need to experiment a few different values of \(C\) and \(\sigma\) to get a good model, but do not
use more than 9 different combinations overall since this can be very
slow.
- Evaluate the performance of your final trained model on the testing
data. Report the accuracy and the ROC curve. For using the ROC curve,
see our lecture note in Week 5. You can consider using the
ROCR
package for this task. For the scaler predictor, use
the fitted values \(x^T \beta +
\beta_0\) from the SVM model.
- Compare your results with a penalized logistic regression model
using both the accuracy and the ROC curve.