Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted since they may not be readable in
Gradescope. All proofs must be typed in LaTeX format. Make all of your
R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
version \(\geq 4.0.0\).
This will ensure your random seed generation is the same as everyone
else. Please note that updating the R
version may require
you to reinstall all of your packages. In the markdown file, set
seed properly so that the results can be reproduced.\[ k(x, y) = (x^T y)^2 \]
achieves the feature mapping that contains all second order interaction terms. Use this to show that the kernel function
\[ k(x, y) = (x^T y + 1)^2 \]
where \(c > 0\) is a constant, achieves the feature mapping that contains all second order terms plus all first order terms and an intercept term.
In a paper called Outcome Weighted Learning by Zhao et al. (2012) proposed to use the following objective function (see Equation 2.2 in the paper) to estimate optimal treatment regimes. This is a weighted support vector machine classification problem.
\[ \min_{f \in \mathcal{H}_K} \frac{1}{n} \sum_{i=1}^n w_i \big(1 - A_i f(x_i)\big)_+ + \lambda ||f||^2_K \]
where \(w_i\) is some observation weight, \(A_i \in \{-1, 1\}\) is a binary outcome, and \(\mathcal{H}_K\) is a RKHS with kernel function \(K\).
[10 pts] Note that this is corresponding to a penalized version of the SVM. Please provide the primal form of the Lagrangian corresponding SVM optimization problem with a linear kernel.
[20 pts] Please provide the dual form of the corresponding problem, and derive the final optimization problem based on the dual.
[30 pts] Take the mixture.example
dataset in the
ElemStatLearn
package, and use the following code to define
the weights:
library(ElemStatLearn)
data(mixture.example)
# the grid on x1 and x2
px1 = mixture.example$px1
px2 = mixture.example$px2
# plot the data and true decision boundary
prob <- mixture.example$prob
prob.bayes <- matrix(prob,
length(px1),
length(px2))
contour(px1, px2, prob.bayes, levels=0.5, lty=2,
labels="", xlab="x1",ylab="x2",
main="SVM with linear kernal", col = "red", lwd = 2)
points(mixture.example$x, col=ifelse(mixture.example$y==1, "darkorange", "deepskyblue"), pch = 19)
# define the new weights
w = rep(1, nrow(mixture.example$x))
w[mixture.example$x[,1] < 0.8 & mixture.example$x[,2] > 1 & mixture.example$y == 0] = 10
# plot the data and visualized the ones with large weights
plot(mixture.example$x, col = ifelse(mixture.example$y == 1, "darkorange", "deepskyblue"), cex = 0.5, pch = 19)
points(mixture.example$x[w == 10, ], col = "darkgreen", pch = 19)
Explore the quadprog
package in R
to solve
the dual problem you derived in part (b). For this question, change the
inner product \(x_i^T x_j\) to a
polynomial kernel with degree 4, and make sure to add an intercept term
before applying the kernel function. The solve.QP()
function in this function is specified as
solve.QP(Dmat, dvec, Amat, bvec, meq = 0)
, where
Dmat
is the quadratic coefficient matrix. The matrix
must be positive definite. You can add a small value (e.g.,
1e-5
) to the diagonal of \(D\) to avoid numerical errors.dvec
is the linear term vector. The optimization
problem is defined as \(\min \frac{1}{2}
{\boldsymbol \alpha}^T D {\boldsymbol \alpha} - d^T \boldsymbol
\alpha\).Amat
and bvec
define the constraints such
that \(A^T \alpha \geq b\). Note that
the constraints are defined in terms of \(A^T\) instead of \(A\).meq
is the number of equality constraints so that the
first meq
constraints in \(A^T
\alpha \geq b\) are treated as equality constraints. The rest are
treated as one-sided inequality constraints as you specified.For more details, you should read the documentation
of the solve.QP()
function. Perform the following
solve.QP()
function.solve.QP()
function,
and visualize the decision boundary. You may use the previous code chunk
to visualize the fitted decision boundary by define your own
prob.bayes
matrix. The contour()
function will
help you visualize the decision boundary.If you need a reference for what you are heading to, here is a plot of my fitted SVM
R
(e.g., optim()
) to directly solve the
primal problem. In this case, what is your decision function? Visualize
the decision boundary again. Is it the same as the one you obtained from
the dual form? Please provide a brief discussion of your result.The following code is for PCA of the hand written digits dataset in
the ElemStatLearn
package.
library(ElemStatLearn)
data(zip.train)
# PCA
pca <- prcomp(zip.train[, -1], center=TRUE, scale.=FALSE)
# plot the first two PCs
plot(pca$x[,1], pca$x[,2], col=zip.train[,1]+1, pch=19, cex = 0.5,
xlab="First Principal Component", ylab="Second Principal Component")
The goal of this question is to implement kernel PCA by yourself. You
can use any matrix decomposition functions in R
(e.g.,
eigen()
, svd()
, etc.), but you should most of
the main steps by yourself. Consider the following:
In your previous question, what are the two most difficult digits to separate? You can use the kernel mean embedding and maximum mean discrepancy (MMD) to test whether these two digits are from the same distribution.