Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

About HW8

In this HW, we will code both the primal and dual form of SVM and utilize a general quadratic programming (quadprog package) solve to help us obtain the solution.

Question 1 [50 Points] Sovling SVM using Quadratic Programming

Install the quadprog package. The same package is also available in Python. However, make sure to read their documentations carefully. We will utilize the function solve.QP to solve SVM. This function is trying to perform the minimization problem: \[\begin{align} \text{minimize} & \quad \frac{1}{2} b^T \mathbf{D} b - d^T b, \nonumber \\ \text{subject to} & \quad \mathbf{A}^T b \geq b_0, \nonumber \end{align}\] where \(b\) is the unknown parameter. For more details, read the documentation of the package on CRAN. Use our the provided training data. This is a linearly separable problem.

# this is how I generate the data
# set.seed(3)
# n = 30
# p = 2
# xpos = matrix(rnorm(n*p, mean=0, sd=1), n, p)
# xneg = matrix(rnorm(n*p, mean=3, sd=1), n, p)
# x = rbind(xpos, xneg)
# y = matrix(c(rep(1, n), rep(-1, n)))
# 
# train = data.frame(x1 = x[, 1], x2 = x[, 2], y = y)
# write.csv(train, "SVM-Q1.csv", row.names = FALSE)

train = read.csv("SVM-Q1.csv")
x = as.matrix(train[, 1:2])
y = train[, 3]

plot(x, col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, xlab = "x1", ylab = "x2")
legend("topleft", c("Positive","Negative"), col=c("darkorange", "deepskyblue"), 
       pch=c(19, 19), text.col=c("darkorange", "deepskyblue"))

a) [25 points] The Primal Form

Use the formulation defined on page 13 of the SVM lecture note. The primal problem is

\[ \begin{align} \quad \underset{\beta_{0}, \boldsymbol{\beta}}{\text{minimize}} &\quad \frac{1}{2} \|\boldsymbol{\beta}\|^{2} \\ \text{subject to} &\quad y_{i}\left(x_{i}^{\top} \boldsymbol{\beta}+\beta_{0}\right) \geq 1, \,\, \text{for} \,\, i=1, \ldots, n \end{align} \]

Perform the following:

  • Let \(b = (\beta_0, \boldsymbol \beta)\) in the solve.QP() function. Properly define \(\mathbf{D}\), \(d\), \(\mathbf{A}\) and \(b_0\) corresponding to this \(b\) for the linearly separable SVM primal problem.
  • Calculate the decision function by solving this optimization problem with the solve.QP() function.
  • Report our \(\beta_0\) and \(\boldsymbol \beta\)
  • Plot the decision line on top the previous training data scatter plot. Include the two margin lines. Clearly mark the support vectors.

Note: The package requires \(\mathbf{D}\) to be positive definite, while it is not true in our case. To address this problem, add \(10^{-10}\) to the top-left element of your \(\mathbf{D}\) matrix, which is the one corresponding to \(\beta_0\). This will make \(\mathbf{D}\) invertible. This may affect your results slightly. So be careful when plotting your support vectors.

Answer:

Define a \((p+1) \times (p+1)\) matrix \(D\), \(n \times (p+1)\) matrix \(A^T\) and \(1_{n+1}\) as \(b_0\).

  • \(\beta_0\) is also included in the object function.
  • A small number 1e-10 is added to first element of \(D\).
  • The \(i\)-th row of \(A^T\) is \(y_i (x_i^T, 1)\).
library(quadprog)
n = length(y)
# calculate D, d, A, b0
D = diag(3)
D[1,1] = 1e-10
d = rep(0, 3)
A = t(sweep(cbind(1, x), 1, y, FUN = "*"))
b0 = rep(1, n)

# solve the quadratic program problem
result = solve.QP(Dmat = D, dvec = d, Amat = A, bvec = b0)$solution
beta = result[2:3]
beta0 = result[1]

# plot the decision line and the support vectors
plot(x,col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, xlab = "x1", ylab = "x2")
points(x[sweep(x %*% beta + beta0, 1, y, FUN = "*") <= 1+1e-5, ], pch = 1, cex = 2)
legend("topleft", c("Positive","Negative", "support vector"),
       col=c("darkorange", "deepskyblue", "black"), pch=c(19, 19, 1), 
       text.col=c("darkorange", "deepskyblue", "black"))
abline(-beta0/beta[2], -beta[1]/beta[2])
abline(-(beta0 - 1)/beta[2], -beta[1]/beta[2], lty=3)
abline(-(beta0 + 1)/beta[2], -beta[1]/beta[2], lty=3)

b) [25 points] The Dual Form

Formulate the SVM dual problem on page 21 the lecture note. The dual problem is

\[ \begin{align} \underset{\boldsymbol \alpha}{\text{maximize}} & \quad \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{\top} x_{j} \\ \text{subject to} & \quad \alpha_{i} \geq 0, \,\, \text{for} \,\, i=1, \ldots, n \\ \text{and} & \quad \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \end{align} \]

Perform the following:

  • Let \(b = (\alpha_1, \ldots, \alpha_n)^T\). Then properly define \(\mathbf{D}\), \(d\), \(\mathbf{A}\) and \(b_0\) corresponding to this \(b\) for our SVM problem.
  • Note: Equality constrains can be addressed using the meq argument.
  • Obtain the solution using the solve.QP() function, and convert the solution into \(\boldsymbol \beta\) and \(\beta_0\).

You need to report

  • A table including \(\beta_0, \beta_1, \beta_2)\) of both Q1a and Q1b. Only keep first three digits after the decimal point.
  • Plot the decision line on top of our scatter plot. Include the two margin lines. Clearly mark the support vectors.
  • Report the \(\ell_1\) norm of \(\beta_{Q1a} - \beta_{Q1b}\), where \(\beta_{Q1a}\) and \(\beta_{Q2b}\) are the 3-dimensional solution obtained in Q1a and Q1b, respectively.

Note: Again, \(\mathbf{D}\) may not be positive definite. This time, add \(10^{-10}\) to all diagonal elements to \(\mathbf{D}\). This may affect your results slightly. So be careful when plotting your support vectors.

Answer:

To adapt the input of solve.qp(), it is equivalent to consider

\[ \underset{\boldsymbol \alpha}{\text{maximize}} \quad \sum_{i=1}^{n} -\alpha_{i} + \frac{1}{2} \sum_{i, j=1}^{n} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{\top} x_{j} \] Now \(d = 1_{n \times 1}\), \(D_{ij} = y_i y_j x_i^T x_j\). \(A^T\) is a \((n+1) \times n\) matrix.

  • We put the equality constraint at the first column of \(A^T\).
  • A small ridge (\((1e-10) I\)) is added to \(D\).
# calculate D,  d, A, b0
D = x %*% t(x) * y %*% t(y)
D = D + diag(1e-10, n) # add ridge
d = rep(1, n)  
A = t(rbind(t(y), diag(1, n)))
b0 = rep(0, n+1)

# solve the quadratic program
alpha = solve.QP(D, d, A, b0, meq = 1)$solution

# beta in Q1a
beta_1a = c(beta0, beta)  

# get beta for 1b
beta = t(x) %*% (alpha * y)
beta0 = -(max((x %*% beta)[y == -1])+ min((x %*% beta)[y == 1]))/2
beta_1b = c(beta0, beta)

beta_tab = rbind(beta_1a, beta_1b)
rownames(beta_tab) = c("Q1a", "Q1b")
colnames(beta_tab) = paste0("beta", 0:2)
knitr::kable(beta_tab, digits = 3, 
             caption = "beta obtained from Q1a and Q1b")
beta obtained from Q1a and Q1b
beta0 beta1 beta2
Q1a 3.419 -1.046 -0.999
Q1b 3.419 -1.046 -0.999
# plot the decision line and the support vectors
plot(x,col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, 
     xlab = "x1", ylab = "x2")

# find the points with large enough (significantly non-zero) alpha
points(x[alpha > 0.1,], pch = 1, cex = 2)
legend("topleft", c("Positive","Negative", "support vector"),
       col=c("darkorange", "deepskyblue", "black"), pch=c(19, 19, 1), 
       text.col=c("darkorange", "deepskyblue", "black"))

abline(-beta0/beta[2], -beta[1]/beta[2])
abline(-(beta0 - 1)/beta[2], -beta[1]/beta[2], lty=3)
abline(-(beta0 + 1)/beta[2], -beta[1]/beta[2], lty=3)

# L1 norm of beta in a and b
sum(abs(beta_tab[1, ] - beta_tab[2, ]))
## [1] 6.353886e-05

Question 2 [20 Points] Linearly nonseparable SVM

In this question, we will follow the formulation in Page 30 to solve a linearly nonseparable SVM. The dual problem is given by

\[ \begin{align} \underset{\boldsymbol \alpha}{\text{maximize}} & \quad \sum_{i=1}^{n} \alpha_{i}-\frac{1}{2} \sum_{i, j=1}^{n} y_{i} y_{j} \alpha_{i} \alpha_{j} x_{i}^{\top} x_{j} \\ \text{subject to} & \quad 0 \leq \alpha_{i} \leq C, \,\, \text{for} \,\, i=1, \ldots, n \\ \text{and} & \quad \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \end{align} \]

Perform the following:

You need to report - The training error - The fitted \(\beta\) and \(\beta_0\) - Plot the decision line on top of the scatter plot of data. Include the two margin lines. Clearly mark the support vectors.

# set.seed(20)
# n = 200 # number of data points for each class
# p = 2 # dimension

# Generate the positive and negative examples
# xpos <- matrix(rnorm(n*p,mean=0,sd=1),n,p)
# xneg <- matrix(rnorm(n*p,mean=1.5,sd=1),n,p)
# x <- rbind(xpos,xneg)
# y <- c(rep(-1, n), rep(1, n))
# train = data.frame(x1 = x[, 1], x2 = x[, 2], y = y)
# write.csv(train, "SVM-Q2.csv", row.names = FALSE)

train = read.csv("SVM-Q2.csv")
x = as.matrix(train[, 1:2])
y = train[, 3]

set.seed(20)
n = 200 # number of data points for each class
p = 2 # dimension

# Generate the positive and negative examples
xpos <- matrix(rnorm(n*p,mean=0,sd=1),n,p)
xneg <- matrix(rnorm(n*p,mean=1.5,sd=1),n,p)
x <- rbind(xpos,xneg)
y <- c(rep(-1, n), rep(1, n))


plot(x, col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, xlab = "x1", ylab = "x2")
legend("topleft", c("Positive","Negative"), col=c("darkorange", "deepskyblue"), 
       pch=c(19, 19), text.col=c("darkorange", "deepskyblue"))

Answer:

# calculate D,  d, A, b0
n = length(y)
D = x %*% t(x) * y %*% t(y) + diag(1e-10, n)
d = rep(1, n)
A = t(rbind(t(y), diag(1, n), diag(-1, n)))
C = 1
b0 = c(rep(0, n+1), rep(-C, n))

# solve the quadratic program
alpha = solve.QP(D, d, A, b0, meq = 1)$solution

# thresholding alpha for numerical errors 
alpha[alpha < 1e-5] = 0
alpha[alpha > 1 - 1e-5] = 1

# get index for support vectors
support_vec = which(alpha > 1e-2 & alpha < C - 1e-2)

# plot the data
plot(x,col=ifelse(y>0,"darkorange", "deepskyblue"), pch = 19, 
     xlab = "x1", ylab = "x2")

# plot support vectors
points(x[support_vec, ], pch = 1, cex = 2)
legend("topleft", c("Positive","Negative", "support vector"),
       col=c("darkorange", "deepskyblue", "black"), pch=c(19, 19, 1), 
       text.col=c("darkorange", "deepskyblue", "black"))

beta = t(x) %*% (alpha * y)
beta0 = -(mean((x[support_vec, ] %*% beta)[y[support_vec] == -1]) + mean((x[support_vec, ] %*% beta)[y[support_vec] == 1]))/2

abline(-beta0/beta[2], -beta[1]/beta[2])
abline(-(beta0 - 1)/beta[2], -beta[1]/beta[2], lty=3)
abline(-(beta0 + 1)/beta[2], -beta[1]/beta[2], lty=3)

print(c(beta0, beta))
## [1] -1.716482  1.151736  1.111614
# training error
train_err = mean((beta0 + x %*% beta) * y < 0)
print(train_err)
## [1] 0.145

The training error is 0.145.

Question 3 [30 Points] Penalized Loss Linear SVM

We can also perform linear and nonlinear classification using the penalized loss framework. In this question, we will only use the linear version. Use the same dataset in Question 2. Consider the following logistic loss function:

\[L(y, f(x)) = \log(1 + e^{- y f(x)}).\] The rest of the job is to solve this optimization problem if given the functional form of \(f(x)\). To do this, we will utilize the general-purpose optimization package/function. For example, in R, you can use the optim() function. Read the documentation of this function (or equivalent ones in Python) and set up the objective function properly to solve for the parameters. If you need an example of how to use the optim() function, read the corresponding part in the example file provided on our course website here (Section 10).

We let \(f(x)\) is to be a linear function, SVM can be solved by optimizing a penalized loss: \[ \underset{\beta_0, \boldsymbol\beta}{\arg\min} \quad \sum_{i=1}^n L(y_i, \beta_0 + x_i^T \boldsymbol\beta) + \lambda \lVert \beta \rVert^2\] You should use the data from Question 1, and answer these questions:

Report the followings:

Answer:

By \(L(y, f(x)) = \log(1 + e^{- y f(x)})\) and \(f(x) = x^T \beta\) (to reduce the notation, we combine \(\beta_0\) and \(\beta\) as \(\alpha\) vector), the gradient of \(L(\alpha)\) is \[ \frac{\partial L}{\partial \alpha} = \frac{\partial L}{\partial f} \frac{\partial f}{\partial \alpha} = \frac{-y \exp[-y f(x)]}{1 + \exp[-y f(x)]} x \] where \[ \frac{\partial L}{\partial f} = \frac{-y \exp[-y f(x)]}{1 + \exp[-y f(x)]} \\ \frac{\partial f}{\partial \alpha} = x \]

Therefore, the derivative of loss function is \([\sum_{i = 1}^n\frac{\partial L_i}{\partial \alpha}] + 2 \lambda(0, \beta^T)^T\). We pick \(\lambda = 1\) (other reasonable \(\lambda\) is also fine)

# loss function 
svmfn <- function(b, x, y, lambda){
  sum(log(1 + exp(- (x %*% b) * y))) + lambda * sum(b[-1]^2)
}

# gradient of loss function
svmgn <- function(b, x, y, lambda){
  e1 = exp(-y*(x %*% b))
  e2 = y*e1 / (1+e1)
  return( - t(x) %*% e2 + c(0, 2*lambda*b[-1]) )
}

mysvmfit <- optim(par = rep(0, 3), fn = svmfn, gr = svmgn,
            x = cbind(1, x), y = y,
            lambda = 1, method = "BFGS")

beta0 = mysvmfit$par[1]
beta = mysvmfit$par[2:3]

print(c(beta0, beta))
## [1] -2.280224  1.575649  1.509805
plot(x, pch = 19, xlab = "x1", ylab = "x2",
     col = ifelse(y>0,"darkorange", "deepskyblue"))
legend("topleft", c("Positive","Negative"), 
       col=c("darkorange", "deepskyblue"), 
       pch=c(19, 19), text.col=c("darkorange", "deepskyblue"))
  
abline(a= -beta0/beta[1], b=-beta[1]/beta[2], col="black", lty=1, lwd = 2)

Training error and loss

# training err and loss
train_err = 1 - mean( y * (x %*% beta + beta0) > 0)
train_loss = mysvmfit$value
print(c(train_err, train_loss))
## [1]   0.1475 129.8600

The training error is 0.1475. The training loss is 129.8600487.