Instruction
About HW8
Question 1 [50 Points] Sovling SVM using Quadratic Programming
- a) [25 points] The Primal Form
- b) [25 points] The Dual Form
Question 2 [20 Points] Linearly nonseparable SVM
Question 3 [30 Points] Penalized Loss Linear SVM

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

What is expected for the submission to Gradescope
- You are required to submit one rendered PDF file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file generated by a .Rmd file. .html format cannot be accepted.
- Please follow the instructions on Gradescope to select corresponding PDF pages for each question.
Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
- Your Name and NetID. (Replace Ruoqing Zhu (rqzhu) by your name and NetID if you are using this template).
- Make all of your R code chunks visible for grading.
- Relevant outputs from your R code chunks that support your answers.
- Provide clear answers or conclusions for each question. For example, you could start with Answer: I fit SVM with the following choice of tuning parameters ...
- Many assignments require your own implementation of algorithms. Basic comments are strongly encouraged to explain the logic to our graders. However, line-by-line code comments are unnecessary.
Requirements regarding the .Rmd file.
- You do NOT need to submit Rmd files. However, your PDF file should be rendered directly from it.
- Make sure that you set random seeds for simulation or randomized algorithms so that the results are reproducible. If a specific seed number is not provided in the homework, you can consider using your NetID.
- For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.

About HW8

In this HW, we will code both the primal and dual form of SVM and utilize a general quadratic programming (quadprog package) solve to help us obtain the solution.

Question 1 [50 Points] Sovling SVM using Quadratic Programming

Install the quadprog package. The same package is also available in Python. However, make sure to read their documentations carefully. We will utilize the function solve.QP to solve SVM. This function is trying to perform the minimization problem: \[\begin{align} \text{minimize} & \quad \frac{1}{2} b^T \mathbf{D} b - d^T b, \nonumber \\ \text{subject to} & \quad \mathbf{A}^T b \geq b_0, \nonumber \end{align}\] where \(b\) is the unknown parameter. For more details, read the documentation of the package on CRAN. Use our the provided training data. This is a linearly separable problem.

# this is how I generate the data
# set.seed(3)
# n = 30
# p = 2
# xpos = matrix(rnorm(n*p, mean=0, sd=1), n, p)
# xneg = matrix(rnorm(n*p, mean=3, sd=1), n, p)
# x = rbind(xpos, xneg)
# y = matrix(c(rep(1, n), rep(-1, n)))
# 
# train = data.frame(x1 = x[, 1], x2 = x[, 2], y = y)
# write.csv(train, "SVM-Q1.csv", row.names = FALSE)

train = read.csv("SVM-Q1.csv")
x = as.matrix(train[, 1:2])
y = train[, 3]

plot(x, col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, xlab = "x1", ylab = "x2")
legend("topleft", c("Positive","Negative"), col=c("darkorange", "deepskyblue"), 
       pch=c(19, 19), text.col=c("darkorange", "deepskyblue"))

a) [25 points] The Primal Form

Use the formulation defined on page 13 of the SVM lecture note. The primal problem is

\[ \begin{align} \quad \underset{\beta_{0}, \boldsymbol{\beta}}{\text{minimize}} &\quad \frac{1}{2} \|\boldsymbol{\beta}\|^{2} \\ \text{subject to} &\quad y_{i}\left(x_{i}^{\top} \boldsymbol{\beta}+\beta_{0}\right) \geq 1, \,\, \text{for} \,\, i=1, \ldots, n \end{align} \]

Perform the following:

Let \(b = (\beta_0, \boldsymbol \beta)\) in the solve.QP() function. Properly define \(\mathbf{D}\), \(d\), \(\mathbf{A}\) and \(b_0\) corresponding to this \(b\) for the linearly separable SVM primal problem.
Calculate the decision function by solving this optimization problem with the solve.QP() function.
Report our \(\beta_0\) and \(\boldsymbol \beta\)
Plot the decision line on top the previous training data scatter plot. Include the two margin lines. Clearly mark the support vectors.

Note: The package requires \(\mathbf{D}\) to be positive definite, while it is not true in our case. To address this problem, add \(10^{-10}\) to the top-left element of your \(\mathbf{D}\) matrix, which is the one corresponding to \(\beta_0\). This will make \(\mathbf{D}\) invertible. This may affect your results slightly. So be careful when plotting your support vectors.

Answer:

Define a \((p+1) \times (p+1)\) matrix \(D\), \(n \times (p+1)\) matrix \(A^T\) and \(1_{n+1}\) as \(b_0\).

\(\beta_0\) is also included in the object function.
A small number 1e-10 is added to first element of \(D\).
The \(i\)-th row of \(A^T\) is \(y_i (x_i^T, 1)\).

library(quadprog)
n = length(y)
# calculate D, d, A, b0
D = diag(3)
D[1,1] = 1e-10
d = rep(0, 3)
A = t(sweep(cbind(1, x), 1, y, FUN = "*"))
b0 = rep(1, n)

# solve the quadratic program problem
result = solve.QP(Dmat = D, dvec = d, Amat = A, bvec = b0)$solution
beta = result[2:3]
beta0 = result[1]

# plot the decision line and the support vectors
plot(x,col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, xlab = "x1", ylab = "x2")
points(x[sweep(x %*% beta + beta0, 1, y, FUN = "*") <= 1+1e-5, ], pch = 1, cex = 2)
legend("topleft", c("Positive","Negative", "support vector"),
       col=c("darkorange", "deepskyblue", "black"), pch=c(19, 19, 1), 
       text.col=c("darkorange", "deepskyblue", "black"))
abline(-beta0/beta[2], -beta[1]/beta[2])
abline(-(beta0 - 1)/beta[2], -beta[1]/beta[2], lty=3)
abline(-(beta0 + 1)/beta[2], -beta[1]/beta[2], lty=3)

b) [25 points] The Dual Form

Formulate the SVM dual problem on page 21 the lecture note. The dual problem is

\[ \begin{align} \underset{\boldsymbol \alpha}{\text{maximize}} & \quad \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{\top} x_{j} \\ \text{subject to} & \quad \alpha_{i} \geq 0, \,\, \text{for} \,\, i=1, \ldots, n \\ \text{and} & \quad \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \end{align} \]

Perform the following:

Let \(b = (\alpha_1, \ldots, \alpha_n)^T\). Then properly define \(\mathbf{D}\), \(d\), \(\mathbf{A}\) and \(b_0\) corresponding to this \(b\) for our SVM problem.
Note: Equality constrains can be addressed using the meq argument.
Obtain the solution using the solve.QP() function, and convert the solution into \(\boldsymbol \beta\) and \(\beta_0\).

You need to report

A table including \(\beta_0, \beta_1, \beta_2)\) of both Q1a and Q1b. Only keep first three digits after the decimal point.
Plot the decision line on top of our scatter plot. Include the two margin lines. Clearly mark the support vectors.
Report the \(\ell_1\) norm of \(\beta_{Q1a} - \beta_{Q1b}\), where \(\beta_{Q1a}\) and \(\beta_{Q2b}\) are the 3-dimensional solution obtained in Q1a and Q1b, respectively.

Note: Again, \(\mathbf{D}\) may not be positive definite. This time, add \(10^{-10}\) to all diagonal elements to \(\mathbf{D}\). This may affect your results slightly. So be careful when plotting your support vectors.

Answer:

To adapt the input of solve.qp(), it is equivalent to consider

\[ \underset{\boldsymbol \alpha}{\text{maximize}} \quad \sum_{i=1}^{n} -\alpha_{i} + \frac{1}{2} \sum_{i, j=1}^{n} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{\top} x_{j} \] Now \(d = 1_{n \times 1}\), \(D_{ij} = y_i y_j x_i^T x_j\). \(A^T\) is a \((n+1) \times n\) matrix.

We put the equality constraint at the first column of \(A^T\).
A small ridge (\((1e-10) I\)) is added to \(D\).

# calculate D,  d, A, b0
D = x %*% t(x) * y %*% t(y)
D = D + diag(1e-10, n) # add ridge
d = rep(1, n)  
A = t(rbind(t(y), diag(1, n)))
b0 = rep(0, n+1)

# solve the quadratic program
alpha = solve.QP(D, d, A, b0, meq = 1)$solution

# beta in Q1a
beta_1a = c(beta0, beta)  

# get beta for 1b
beta = t(x) %*% (alpha * y)
beta0 = -(max((x %*% beta)[y == -1])+ min((x %*% beta)[y == 1]))/2
beta_1b = c(beta0, beta)

beta_tab = rbind(beta_1a, beta_1b)
rownames(beta_tab) = c("Q1a", "Q1b")
colnames(beta_tab) = paste0("beta", 0:2)
knitr::kable(beta_tab, digits = 3, 
             caption = "beta obtained from Q1a and Q1b")

beta obtained from Q1a and Q1b
	beta0	beta1	beta2
Q1a	3.419	-1.046	-0.999
Q1b	3.419	-1.046	-0.999

# plot the decision line and the support vectors
plot(x,col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, 
     xlab = "x1", ylab = "x2")

# find the points with large enough (significantly non-zero) alpha
points(x[alpha > 0.1,], pch = 1, cex = 2)
legend("topleft", c("Positive","Negative", "support vector"),
       col=c("darkorange", "deepskyblue", "black"), pch=c(19, 19, 1), 
       text.col=c("darkorange", "deepskyblue", "black"))

abline(-beta0/beta[2], -beta[1]/beta[2])
abline(-(beta0 - 1)/beta[2], -beta[1]/beta[2], lty=3)
abline(-(beta0 + 1)/beta[2], -beta[1]/beta[2], lty=3)

# L1 norm of beta in a and b
sum(abs(beta_tab[1, ] - beta_tab[2, ]))

## [1] 6.353886e-05

Question 2 [20 Points] Linearly nonseparable SVM

In this question, we will follow the formulation in Page 30 to solve a linearly nonseparable SVM. The dual problem is given by

\[ \begin{align} \underset{\boldsymbol \alpha}{\text{maximize}} & \quad \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{\top} x_{j} \\ \text{subject to} & \quad 0 \leq \alpha_{i} \leq C, \,\, \text{for} \,\, i=1, \ldots, n \\ \text{and} & \quad \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \end{align} \]

Perform the following:

Let \(b = (\alpha_1, \ldots, \alpha_n)^T\). Then properly define \(\mathbf{D}\), \(d\), \(\mathbf{A}\) and \(b_0\) corresponding to this \(b\) for this problem. Use \(C = 1\) as the penalty team.
Note: Equality constrains can be addressed using the meq argument.
Obtain the solution using the solve.QP() function, and convert the solution into \(\boldsymbol \beta\) and \(\beta_0\). Note:
- use the information provided on page 32 to obtain the support vectors and \(\beta_0\).
- Your solution may encounter numerical errors, e.g., very small negative \(\alpha\) values, or values very close to \(C\). You could consider thresholding them to exactly 0 or \(C\)
- Your \(\mathbf{D}\) may not be definite positive, so consider adding \(10^{-10}\) to its diagonal elements.

You need to report - The training error - The fitted \(\beta\) and \(\beta_0\) - Plot the decision line on top of the scatter plot of data. Include the two margin lines. Clearly mark the support vectors.

# set.seed(20)
# n = 200 # number of data points for each class
# p = 2 # dimension

# Generate the positive and negative examples
# xpos <- matrix(rnorm(n*p,mean=0,sd=1),n,p)
# xneg <- matrix(rnorm(n*p,mean=1.5,sd=1),n,p)
# x <- rbind(xpos,xneg)
# y <- c(rep(-1, n), rep(1, n))
# train = data.frame(x1 = x[, 1], x2 = x[, 2], y = y)
# write.csv(train, "SVM-Q2.csv", row.names = FALSE)

train = read.csv("SVM-Q2.csv")
x = as.matrix(train[, 1:2])
y = train[, 3]

set.seed(20)
n = 200 # number of data points for each class
p = 2 # dimension

# Generate the positive and negative examples
xpos <- matrix(rnorm(n*p,mean=0,sd=1),n,p)
xneg <- matrix(rnorm(n*p,mean=1.5,sd=1),n,p)
x <- rbind(xpos,xneg)
y <- c(rep(-1, n), rep(1, n))


plot(x, col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, xlab = "x1", ylab = "x2")
legend("topleft", c("Positive","Negative"), col=c("darkorange", "deepskyblue"), 
       pch=c(19, 19), text.col=c("darkorange", "deepskyblue"))

Answer:

# calculate D,  d, A, b0
n = length(y)
D = x %*% t(x) * y %*% t(y) + diag(1e-10, n)
d = rep(1, n)
A = t(rbind(t(y), diag(1, n), diag(-1, n)))
C = 1
b0 = c(rep(0, n+1), rep(-C, n))

# solve the quadratic program
alpha = solve.QP(D, d, A, b0, meq = 1)$solution

# thresholding alpha for numerical errors 
alpha[alpha < 1e-5] = 0
alpha[alpha > 1 - 1e-5] = 1

# get index for support vectors
support_vec = which(alpha > 1e-2 & alpha < C - 1e-2)

# plot the data
plot(x,col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, 
     xlab = "x1", ylab = "x2")

# plot support vectors
points(x[support_vec, ], pch = 1, cex = 2)
legend("topleft", c("Positive","Negative", "support vector"),
       col=c("darkorange", "deepskyblue", "black"), pch=c(19, 19, 1), 
       text.col=c("darkorange", "deepskyblue", "black"))

beta = t(x) %*% (alpha * y)
beta0 = -(mean((x[support_vec, ] %*% beta)[y[support_vec] == -1]) + mean((x[support_vec, ] %*% beta)[y[support_vec] == 1]))/2

abline(-beta0/beta[2], -beta[1]/beta[2])
abline(-(beta0 - 1)/beta[2], -beta[1]/beta[2], lty=3)
abline(-(beta0 + 1)/beta[2], -beta[1]/beta[2], lty=3)

print(c(beta0, beta))

## [1] -1.716482  1.151736  1.111614

# training error
train_err = mean((beta0 + x %*% beta) * y < 0)
print(train_err)

## [1] 0.145

The training error is 0.145.

Question 3 [30 Points] Penalized Loss Linear SVM

We can also perform linear and nonlinear classification using the penalized loss framework. In this question, we will only use the linear version. Use the same dataset in Question 2. Consider the following logistic loss function:

\[L(y, f(x)) = \log(1 + e^{- y f(x)}).\] The rest of the job is to solve this optimization problem if given the functional form of \(f(x)\). To do this, we will utilize the general-purpose optimization package/function. For example, in R, you can use the optim() function. Read the documentation of this function (or equivalent ones in Python) and set up the objective function properly to solve for the parameters. If you need an example of how to use the optim() function, read the corresponding part in the example file provided on our course website here (Section 10).

We let \(f(x)\) is to be a linear function, SVM can be solved by optimizing a penalized loss: \[ \underset{\beta_0, \boldsymbol\beta}{\arg\min} \quad \sum_{i=1}^n L(y_i, \beta_0 + x_i^T \boldsymbol\beta) + \lambda \lVert \beta \rVert^2\] You should use the data from Question 1, and answer these questions:

[10 pts] Drive the gradient of this loss function, typeset with LaTex.
[10 pts] Write a penalized loss objective function SVMfn(b, x, y, lambda) and its gradient SVMgn(b, x, y, lambda).
[10 pts] Solve the coeffients using optim() and your objective and gradient functions with \(\lambda = 1\) and BFGS method. Use 0 as the initilized value.

Report the followings:

Your coefficients
Your loss and mis-classification rate on training data.
Plot all data and the decision line
Hint: If you want to check your gradient function, you can run optim() without a this gradient function and compare the parameters to your previous ones. Note this will be much slower. You are not required to report this result.

Answer:

By \(L(y, f(x)) = \log(1 + e^{- y f(x)})\) and \(f(x) = x^T \beta\) (to reduce the notation, we combine \(\beta_0\) and \(\beta\) as \(\alpha\) vector), the gradient of \(L(\alpha)\) is \[ \frac{\partial L}{\partial \alpha} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial \alpha} = \frac{-y \exp[-y f(x)]}{1 + \exp[-y f(x)]} x \] where \[ \frac{\partial L}{\partial f} = \frac{-y \exp[-y f(x)]}{1 + \exp[-y f(x)]} \\ \frac{\partial f}{\partial \alpha} = x \]

Therefore, the derivative of loss function is \([\sum_{i = 1}^n\frac{\partial L_i}{\partial \alpha}] + 2 \lambda(0, \beta^T)^T\). We pick \(\lambda = 1\) (other reasonable \(\lambda\) is also fine)

# loss function 
svmfn <- function(b, x, y, lambda){
  sum(log(1 + exp(- (x %*% b) * y))) + lambda * sum(b[-1]^2)
}

# gradient of loss function
svmgn <- function(b, x, y, lambda){
  e1 = exp(-y*(x %*% b))
  e2 = y*e1 / (1+e1)
  return( - t(x) %*% e2 + c(0, 2*lambda*b[-1]) )
}

mysvmfit <- optim(par = rep(0, 3), fn = svmfn, gr = svmgn,
            x = cbind(1, x), y = y,
            lambda = 1, method = "BFGS")

beta0 = mysvmfit$par[1]
beta = mysvmfit$par[2:3]

print(c(beta0, beta))

## [1] -2.280224  1.575649  1.509805

plot(x, pch = 19, xlab = "x1", ylab = "x2",
     col = ifelse(y>0,"darkorange", "deepskyblue"))
legend("topleft", c("Positive","Negative"), 
       col=c("darkorange", "deepskyblue"), 
       pch=c(19, 19), text.col=c("darkorange", "deepskyblue"))
  
abline(a= -beta0/beta[1], b=-beta[1]/beta[2], col="black", lty=1, lwd = 2)

Training error and loss

# training err and loss
train_err = 1 - mean( y * (x %*% beta + beta0) > 0)
train_loss = mysvmfit$value
print(c(train_err, train_loss))

## [1]   0.1475 129.8600

The training error is 0.1475. The training loss is 129.8600487.

STAT 542: Homework 8

Spring 2022, by Ruoqing Zhu (rqzhu)

Due: Thursday, March 31, 11:59 PM CT

Instruction

About HW8

Question 1 [50 Points] Sovling SVM using Quadratic Programming

a) [25 points] The Primal Form

b) [25 points] The Dual Form

Question 2 [20 Points] Linearly nonseparable SVM

Question 3 [30 Points] Penalized Loss Linear SVM