Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
What is expected for the submission to Gradescope
HWx_yourNetID.pdf
. For example, HW01_rqzhu.pdf
. Please note that this must be a .pdf
file generated by a .Rmd
file. .html
format cannot be accepted.Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
Ruoqing Zhu (rqzhu)
by your name and NetID if you are using this template).R
code chunks visible for grading.R
code chunks that support your answers.Answer: I fit SVM with the following choice of tuning parameters ...
Requirements regarding the .Rmd
file.
Rmd
files. However, your PDF file should be rendered directly from it.In this HW, we will code both the primal and dual form of SVM and utilize a general quadratic programming (quadprog
package) solve to help us obtain the solution.
Install the quadprog
package. The same package is also available in Python. However, make sure to read their documentations carefully. We will utilize the function solve.QP
to solve SVM. This function is trying to perform the minimization problem: \[\begin{align}
\text{minimize} & \quad \frac{1}{2} b^T \mathbf{D} b - d^T b, \nonumber \\
\text{subject to} & \quad \mathbf{A}^T b \geq b_0, \nonumber
\end{align}\] where \(b\) is the unknown parameter. For more details, read the documentation of the package on CRAN. Use our the provided training data. This is a linearly separable problem.
# this is how I generate the data
# set.seed(3)
# n = 30
# p = 2
# xpos = matrix(rnorm(n*p, mean=0, sd=1), n, p)
# xneg = matrix(rnorm(n*p, mean=3, sd=1), n, p)
# x = rbind(xpos, xneg)
# y = matrix(c(rep(1, n), rep(-1, n)))
#
# train = data.frame(x1 = x[, 1], x2 = x[, 2], y = y)
# write.csv(train, "SVM-Q1.csv", row.names = FALSE)
train = read.csv("SVM-Q1.csv")
x = as.matrix(train[, 1:2])
y = train[, 3]
plot(x, col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, xlab = "x1", ylab = "x2")
legend("topleft", c("Positive","Negative"), col=c("darkorange", "deepskyblue"),
pch=c(19, 19), text.col=c("darkorange", "deepskyblue"))
Use the formulation defined on page 13 of the SVM
lecture note. The primal problem is
\[ \begin{align} \quad \underset{\beta_{0}, \boldsymbol{\beta}}{\text{minimize}} &\quad \frac{1}{2} \|\boldsymbol{\beta}\|^{2} \\ \text{subject to} &\quad y_{i}\left(x_{i}^{\top} \boldsymbol{\beta}+\beta_{0}\right) \geq 1, \,\, \text{for} \,\, i=1, \ldots, n \end{align} \]
Perform the following:
solve.QP()
function. Properly define \(\mathbf{D}\), \(d\), \(\mathbf{A}\) and \(b_0\) corresponding to this \(b\) for the linearly separable SVM primal problem.solve.QP()
function.Note: The package requires \(\mathbf{D}\) to be positive definite, while it is not true in our case. To address this problem, add \(10^{-10}\) to the top-left element of your \(\mathbf{D}\) matrix, which is the one corresponding to \(\beta_0\). This will make \(\mathbf{D}\) invertible. This may affect your results slightly. So be careful when plotting your support vectors.
Answer:
Define a \((p+1) \times (p+1)\) matrix \(D\), \(n \times (p+1)\) matrix \(A^T\) and \(1_{n+1}\) as \(b_0\).
library(quadprog)
n = length(y)
# calculate D, d, A, b0
D = diag(3)
D[1,1] = 1e-10
d = rep(0, 3)
A = t(sweep(cbind(1, x), 1, y, FUN = "*"))
b0 = rep(1, n)
# solve the quadratic program problem
result = solve.QP(Dmat = D, dvec = d, Amat = A, bvec = b0)$solution
beta = result[2:3]
beta0 = result[1]
# plot the decision line and the support vectors
plot(x,col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, xlab = "x1", ylab = "x2")
points(x[sweep(x %*% beta + beta0, 1, y, FUN = "*") <= 1+1e-5, ], pch = 1, cex = 2)
legend("topleft", c("Positive","Negative", "support vector"),
col=c("darkorange", "deepskyblue", "black"), pch=c(19, 19, 1),
text.col=c("darkorange", "deepskyblue", "black"))
abline(-beta0/beta[2], -beta[1]/beta[2])
abline(-(beta0 - 1)/beta[2], -beta[1]/beta[2], lty=3)
abline(-(beta0 + 1)/beta[2], -beta[1]/beta[2], lty=3)
Formulate the SVM dual problem on page 21 the lecture note. The dual problem is
\[ \begin{align} \underset{\boldsymbol \alpha}{\text{maximize}} & \quad \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{\top} x_{j} \\ \text{subject to} & \quad \alpha_{i} \geq 0, \,\, \text{for} \,\, i=1, \ldots, n \\ \text{and} & \quad \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \end{align} \]
Perform the following:
meq
argument.solve.QP()
function, and convert the solution into \(\boldsymbol \beta\) and \(\beta_0\).You need to report
Note: Again, \(\mathbf{D}\) may not be positive definite. This time, add \(10^{-10}\) to all diagonal elements to \(\mathbf{D}\). This may affect your results slightly. So be careful when plotting your support vectors.
Answer:
To adapt the input of solve.qp()
, it is equivalent to consider
\[ \underset{\boldsymbol \alpha}{\text{maximize}} \quad \sum_{i=1}^{n} -\alpha_{i} + \frac{1}{2} \sum_{i, j=1}^{n} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{\top} x_{j} \] Now \(d = 1_{n \times 1}\), \(D_{ij} = y_i y_j x_i^T x_j\). \(A^T\) is a \((n+1) \times n\) matrix.
# calculate D, d, A, b0
D = x %*% t(x) * y %*% t(y)
D = D + diag(1e-10, n) # add ridge
d = rep(1, n)
A = t(rbind(t(y), diag(1, n)))
b0 = rep(0, n+1)
# solve the quadratic program
alpha = solve.QP(D, d, A, b0, meq = 1)$solution
# beta in Q1a
beta_1a = c(beta0, beta)
# get beta for 1b
beta = t(x) %*% (alpha * y)
beta0 = -(max((x %*% beta)[y == -1])+ min((x %*% beta)[y == 1]))/2
beta_1b = c(beta0, beta)
beta_tab = rbind(beta_1a, beta_1b)
rownames(beta_tab) = c("Q1a", "Q1b")
colnames(beta_tab) = paste0("beta", 0:2)
knitr::kable(beta_tab, digits = 3,
caption = "beta obtained from Q1a and Q1b")
beta0 | beta1 | beta2 | |
---|---|---|---|
Q1a | 3.419 | -1.046 | -0.999 |
Q1b | 3.419 | -1.046 | -0.999 |
# plot the decision line and the support vectors
plot(x,col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19,
xlab = "x1", ylab = "x2")
# find the points with large enough (significantly non-zero) alpha
points(x[alpha > 0.1,], pch = 1, cex = 2)
legend("topleft", c("Positive","Negative", "support vector"),
col=c("darkorange", "deepskyblue", "black"), pch=c(19, 19, 1),
text.col=c("darkorange", "deepskyblue", "black"))
abline(-beta0/beta[2], -beta[1]/beta[2])
abline(-(beta0 - 1)/beta[2], -beta[1]/beta[2], lty=3)
abline(-(beta0 + 1)/beta[2], -beta[1]/beta[2], lty=3)
# L1 norm of beta in a and b
sum(abs(beta_tab[1, ] - beta_tab[2, ]))
## [1] 6.353886e-05
In this question, we will follow the formulation in Page 30 to solve a linearly nonseparable SVM. The dual problem is given by
\[ \begin{align} \underset{\boldsymbol \alpha}{\text{maximize}} & \quad \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{\top} x_{j} \\ \text{subject to} & \quad 0 \leq \alpha_{i} \leq C, \,\, \text{for} \,\, i=1, \ldots, n \\ \text{and} & \quad \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \end{align} \]
Perform the following:
meq
argument.solve.QP()
function, and convert the solution into \(\boldsymbol \beta\) and \(\beta_0\). Note:
You need to report - The training error - The fitted \(\beta\) and \(\beta_0\) - Plot the decision line on top of the scatter plot of data. Include the two margin lines. Clearly mark the support vectors.
# set.seed(20)
# n = 200 # number of data points for each class
# p = 2 # dimension
# Generate the positive and negative examples
# xpos <- matrix(rnorm(n*p,mean=0,sd=1),n,p)
# xneg <- matrix(rnorm(n*p,mean=1.5,sd=1),n,p)
# x <- rbind(xpos,xneg)
# y <- c(rep(-1, n), rep(1, n))
# train = data.frame(x1 = x[, 1], x2 = x[, 2], y = y)
# write.csv(train, "SVM-Q2.csv", row.names = FALSE)
train = read.csv("SVM-Q2.csv")
x = as.matrix(train[, 1:2])
y = train[, 3]
set.seed(20)
n = 200 # number of data points for each class
p = 2 # dimension
# Generate the positive and negative examples
xpos <- matrix(rnorm(n*p,mean=0,sd=1),n,p)
xneg <- matrix(rnorm(n*p,mean=1.5,sd=1),n,p)
x <- rbind(xpos,xneg)
y <- c(rep(-1, n), rep(1, n))
plot(x, col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, xlab = "x1", ylab = "x2")
legend("topleft", c("Positive","Negative"), col=c("darkorange", "deepskyblue"),
pch=c(19, 19), text.col=c("darkorange", "deepskyblue"))
Answer:
# calculate D, d, A, b0
n = length(y)
D = x %*% t(x) * y %*% t(y) + diag(1e-10, n)
d = rep(1, n)
A = t(rbind(t(y), diag(1, n), diag(-1, n)))
C = 1
b0 = c(rep(0, n+1), rep(-C, n))
# solve the quadratic program
alpha = solve.QP(D, d, A, b0, meq = 1)$solution
# thresholding alpha for numerical errors
alpha[alpha < 1e-5] = 0
alpha[alpha > 1 - 1e-5] = 1
# get index for support vectors
support_vec = which(alpha > 1e-2 & alpha < C - 1e-2)
# plot the data
plot(x,col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19,
xlab = "x1", ylab = "x2")
# plot support vectors
points(x[support_vec, ], pch = 1, cex = 2)
legend("topleft", c("Positive","Negative", "support vector"),
col=c("darkorange", "deepskyblue", "black"), pch=c(19, 19, 1),
text.col=c("darkorange", "deepskyblue", "black"))
beta = t(x) %*% (alpha * y)
beta0 = -(mean((x[support_vec, ] %*% beta)[y[support_vec] == -1]) + mean((x[support_vec, ] %*% beta)[y[support_vec] == 1]))/2
abline(-beta0/beta[2], -beta[1]/beta[2])
abline(-(beta0 - 1)/beta[2], -beta[1]/beta[2], lty=3)
abline(-(beta0 + 1)/beta[2], -beta[1]/beta[2], lty=3)
print(c(beta0, beta))
## [1] -1.716482 1.151736 1.111614
# training error
train_err = mean((beta0 + x %*% beta) * y < 0)
print(train_err)
## [1] 0.145
The training error is 0.145.
We can also perform linear and nonlinear classification using the penalized loss framework. In this question, we will only use the linear version. Use the same dataset in Question 2. Consider the following logistic loss function:
\[L(y, f(x)) = \log(1 + e^{- y f(x)}).\] The rest of the job is to solve this optimization problem if given the functional form of \(f(x)\). To do this, we will utilize the general-purpose optimization package/function. For example, in R
, you can use the optim()
function. Read the documentation of this function (or equivalent ones in Python) and set up the objective function properly to solve for the parameters. If you need an example of how to use the optim()
function, read the corresponding part in the example file provided on our course website here (Section 10).
We let \(f(x)\) is to be a linear function, SVM can be solved by optimizing a penalized loss: \[ \underset{\beta_0, \boldsymbol\beta}{\arg\min} \quad \sum_{i=1}^n L(y_i, \beta_0 + x_i^T \boldsymbol\beta) + \lambda \lVert \beta \rVert^2\] You should use the data from Question 1, and answer these questions:
SVMfn(b, x, y, lambda)
and its gradient SVMgn(b, x, y, lambda)
.optim()
and your objective and gradient functions with \(\lambda = 1\) and BFGS
method. Use 0 as the initilized value.Report the followings:
optim()
without a this gradient function and compare the parameters to your previous ones. Note this will be much slower. You are not required to report this result.Answer:
By \(L(y, f(x)) = \log(1 + e^{- y f(x)})\) and \(f(x) = x^T \beta\) (to reduce the notation, we combine \(\beta_0\) and \(\beta\) as \(\alpha\) vector), the gradient of \(L(\alpha)\) is \[ \frac{\partial L}{\partial \alpha} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial \alpha} = \frac{-y \exp[-y f(x)]}{1 + \exp[-y f(x)]} x \] where \[ \frac{\partial L}{\partial f} = \frac{-y \exp[-y f(x)]}{1 + \exp[-y f(x)]} \\ \frac{\partial f}{\partial \alpha} = x \]
Therefore, the derivative of loss function is \([\sum_{i = 1}^n\frac{\partial L_i}{\partial \alpha}] + 2 \lambda(0, \beta^T)^T\). We pick \(\lambda = 1\) (other reasonable \(\lambda\) is also fine)
# loss function
svmfn <- function(b, x, y, lambda){
sum(log(1 + exp(- (x %*% b) * y))) + lambda * sum(b[-1]^2)
}
# gradient of loss function
svmgn <- function(b, x, y, lambda){
e1 = exp(-y*(x %*% b))
e2 = y*e1 / (1+e1)
return( - t(x) %*% e2 + c(0, 2*lambda*b[-1]) )
}
mysvmfit <- optim(par = rep(0, 3), fn = svmfn, gr = svmgn,
x = cbind(1, x), y = y,
lambda = 1, method = "BFGS")
beta0 = mysvmfit$par[1]
beta = mysvmfit$par[2:3]
print(c(beta0, beta))
## [1] -2.280224 1.575649 1.509805
plot(x, pch = 19, xlab = "x1", ylab = "x2",
col = ifelse(y>0,"darkorange", "deepskyblue"))
legend("topleft", c("Positive","Negative"),
col=c("darkorange", "deepskyblue"),
pch=c(19, 19), text.col=c("darkorange", "deepskyblue"))
abline(a= -beta0/beta[1], b=-beta[1]/beta[2], col="black", lty=1, lwd = 2)
Training error and loss
# training err and loss
train_err = 1 - mean( y * (x %*% beta + beta0) > 0)
train_loss = mysvmfit$value
print(c(train_err, train_loss))
## [1] 0.1475 129.8600
The training error is 0.1475. The training loss is 129.8600487.