Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
What is expected for the submission to Gradescope
HWx_yourNetID.pdf
. For example, HW01_rqzhu.pdf
. Please note that this must be a .pdf
file generated by a .Rmd
file. .html
format cannot be accepted.Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
Ruoqing Zhu (rqzhu)
by your name and NetID if you are using this template).R
code chunks visible for grading.R
code chunks that support your answers.Answer: I fit SVM with the following choice of tuning parameters ...
Requirements regarding the .Rmd
file.
Rmd
files. However, your PDF file should be rendered directly from it.Ozone
data from the mlbench
package.We will fit and compare different spline models to the Ozone
dataset form the mlbench
package. We have pre-processed this dataset. Please use our provided train/test csv files. This dataset has three variables: time
, ozone
and wind
. For the spline question, we will only use time
as the covariate and ozone
as the outcome.
train = read.csv("ozoneTrain.csv")
test = read.csv("ozoneTest.csv")
par(mfrow=c(1,2))
plot(train$time, train$ozone, pch = 19, cex = 0.5)
plot(train$wind + runif(nrow(train), -0.15, 0.15),
train$ozone, pch = 19, cex = 0.5)
Let’s consider several different spline constructions to model the ozone
level using time
. Please read the requirements carefully since some of them require you to write your own code. - To test your model, use the train/test split provided above. - Use the mean squared error as the metric for evaluation and report the MSE for each method in a single summary table. - For the spline basis that you write with your own code, make sure to include the intercept term. - For question a) and b) and d), provide the following summary information at the end of your answer: - A table of MSE on testing data. Please label each method clearly. - Three figures (you could consider making them side-by-side for a better comparison) at the end. Each figure consists of scatter plots of both training and testing data points and the fitted curve on the range x = seq(0, 1, by = 0.01)
.
[10 pts] Write your own code (you cannot use bs()
or similar functions) to implement a continuous piece-wise linear fitting. Pick 4 knots at \((0.2, 0.4, 0.6, 0.8)\).
[10 pts] Write your own code to implement a quadratic spline fitting. Your spline should be continuous up to the first derivative. Pick 4 knots as \((0.2, 0.4, 0.6, 0.8)\).
[10 pts] Produce a set of B-spline basis with the same knots and degrees as (ii) using the bs()
function. Note that they do not have to be exactly the same as yours. Compare the design matrix from b) and c) as follows:
[10 pts] Use existing functions to implement a smoothing spline. Use the built-in generalized cross-validation method to select the best tuning parameter.
Answer:
library(splines)
# rectified linear function
pos <- function(x){
x*(x>0)
}
# piecewise linear spline basis
linear.basis <- function(x, knots){
X.mat <- matrix(NA, nrow = length(x), ncol = 2 + length(knots))
X.mat[, 1] <- 1
X.mat[, 2] <- x
for(j in 1:length(knots)){
X.mat[, j+2] <- pos(x - knots[j])
}
colnames(X.mat) <- paste("x", 1:ncol(X.mat), sep = "")
return(X.mat)
}
# quadratic spline basis
quad.basis <- function(x, knots){
X.mat <- matrix(NA, nrow = length(x), ncol = 3 + length(knots))
X.mat[, 1] <- 1
X.mat[, 2] <- x
X.mat[, 3] <- x^2
for(j in 1:length(knots)){
X.mat[, j+3] <- pos(x - knots[j])^2
}
colnames(X.mat) <- paste("x", 1:ncol(X.mat), sep = "")
return(X.mat)
}
# (i) continuous piece-wise linear
plot.grid = seq(0, 1, by = 0.01)
my.knots = c(0.2, 0.4, 0.6, 0.8)
fit.linear = lm(y ~ .-1, data = data.frame("y" = train$ozone, linear.basis(train$time, my.knots)))
pred.linear = predict(fit.linear, data.frame(linear.basis(test$time, my.knots)))
plot.linear = predict(fit.linear, data.frame(linear.basis(plot.grid, my.knots)))
mse.linear = mean((pred.linear - test$ozone)^2)
# (ii) quadratic spline
fit.quad <- lm(y ~ .-1, data = data.frame("y" = train$ozone, quad.basis(train$time, my.knots)))
pred.quad <- predict(fit.quad, data.frame(quad.basis(test$time, my.knots)))
plot.quad = predict(fit.quad, data.frame(quad.basis(plot.grid, my.knots)))
# testing MSE
mse.quad = mean((pred.quad - test$ozone)^2)
mybase = quad.basis(train$time, my.knots)
bsbase = bs(x = train$time, knots = my.knots, intercept = TRUE, degree = 2)
H1 = mybase %*% solve(t(mybase) %*% mybase) %*% t(mybase)
H2 = bsbase %*% solve(t(bsbase) %*% bsbase) %*% t(bsbase)
max(abs(H1 - H2))
## [1] 3.068074e-12
kappa(mybase)
## [1] 859.6264
kappa(bsbase)
## [1] 3.860331
# default: cv = FALSE for ‘generalized’ cross-validation
fit.ss <- smooth.spline(train$time, train$ozone)
pred.ss <- predict(fit.ss, test$time)
plot.ss = predict(fit.ss, plot.grid)$y
fit.ss$df
## [1] 67.33415
mse.ss <- mean((pred.ss$y - test$ozone)^2)
print(mse.ss)
## [1] 23.71095
mse.table = matrix(c(mse.linear, mse.quad, mse.ss), nrow = 1)
methods = c("linear", "quadratic", "smoothing spline")
colnames(mse.table) = methods
rownames(mse.table) = "Testing MSE"
knitr::kable(mse.table, digits = 3)
linear | quadratic | smoothing spline | |
---|---|---|---|
Testing MSE | 31.086 | 30.098 | 23.711 |
pred.mat = cbind(plot.linear, plot.quad, plot.ss)
par(mfrow = c(1, 3))
for (i in 1:3){
plot(train$time, train$ozone, col = "deepskyblue", pch = 19, main = methods[i])
points(test$time, test$ozone, col = "darkorange", pch = 19)
lines(plot.grid, pred.mat[, i], col = "red", lwd = 3)
}
We will use the same ozone data. For Question a), we only use time
as the covariate, while in Question b, we use both time
and wind
.
You are required to implement (write your own code) two kernel regression models, using the following two kernels function, respectively:
For both kernel functions, incorporate a bandwidth \(\lambda\). You should start with the Silverman’s rule-of-thumb for the choice of \(\lambda\), and then tune \(\lambda\) (for example, increase or decrease by 10%, 20%, 30% etc.). Then, perform the following:
We consider using both time
and wind
in the regression. We use the following multivariate kernel function, which is essentially a Gaussian kernel with diagonal covariance matrix. You can also view this as the product of two kernel functions, corresponding to the two variables:
\[ K_{\boldsymbol \lambda}(x, z) \propto \exp\Big\{ -\frac{1}{2} \sum_{j=1}^p \big( (x_j - z_j)/\lambda_j \big)^2 \Big\}\] Based on the Silverman’s formula, the bandwidth for the \(j\)th variable is given by \[\lambda_k = \left(\frac{4}{p+2}\right)^{\frac{1}{p+4}} n^{-\frac{1}{p+4}} \, \, \widehat \sigma_j,\] where \(\widehat\sigma_j\) is the estimated standard deviation for variable \(j\). Use the Nadaraya-Watson kernel estimator to fit and predict the ozone
level using time
and wind
. At the end, report the testing error.
In our lecture, we only introduced the one-dimensional case for density estimations. For a regression problem, the rate is essentially the same. However, when we have multivariate kernels, the rate of bias and variance will change. If we use the same \(\lambda\) from the one-dimensional case in a two-dimensional problem, would you expect the bias to increase/decrease/unchanged? Would you expect the variance to increase/decrease/unchanged? And why? Hint: for the same \(\lambda\), bias is quantified by how far your neighboring points are, and variance is quantified by how many points you are capturing within that neighborhood.
a) I
GaussianKernel_Uni <- function(x, y, bw){
exp(- ((x - y)/bw)^2 / 2) / sqrt(2*pi)
}
EpanechnikovKernel_Uni <- function(x, y, bw){
u = 1 - ((x - y)/bw)^2
return(0.75 * u * (u > 0))
}
KernelRegUni <- function(x, y, testx, K = "Gaussian", bw){
pred = rep(NA, length(testx))
for (i in 1:length(testx)){
if (K == "Gaussian")
w = GaussianKernel_Uni(testx[i], x, bw)
if (K == "Epanechnikov")
w = EpanechnikovKernel_Uni(testx[i], x, bw)
pred[i] = sum(y * w) / sum(w)
}
return(pred)
}
# Calculate the bandwidth
h = 1.06*sd(train$time)*nrow(train)^{-1/5}
print(h)
## [1] 0.1015303
# Fit and plot the regression lines
Pred_G = KernelRegUni(train$time, train$ozone, testx = test$time, bw = h)
Pred_E = KernelRegUni(train$time, train$ozone, testx = test$time,
K = "Epanechnikov", bw = h)
plot(train$time, train$ozone, col = "deepskyblue", pch = 19)
points(test$time, test$ozone, col = "darkorange", pch = 19)
lines(test$time, Pred_G, col = "red", lwd = 3)
lines(test$time, Pred_E, col = "green", lwd = 3)
legend("topleft", legend = c("Gaussian", "Epanechnikov"),
lty = 1, col = c("red", "green"))
mse_E = mean((Pred_E - test$ozone)^2)
mse_G = mean((Pred_G - test$ozone)^2)
mse.table = matrix(c(mse_G, mse_E), nrow = 1)
colnames(mse.table) = c("Gaussian", "Epanechnikov")
rownames(mse.table) = "Testing MSE"
knitr::kable(mse.table, digits = 3)
Gaussian | Epanechnikov | |
---|---|---|
Testing MSE | 30.076 | 29.874 |
a) II
h.cv = h * c(0.1, 0.2, 0.3, 0.4, 0.5, 1, 1.5, 2, 3, 5)
mse.cv = rep(0, 10)
for (i in 1:10){
Pred_E = KernelRegUni(train$time, train$ozone, testx = test$time,
K = "Epanechnikov", bw = h.cv[i])
mse.cv[i] = mean( (Pred_E - test$ozone)^2)
}
names(mse.cv) = paste(c(0.1, 0.2, 0.3, 0.4, 0.5, 1, 1.5, 2, 3, 5), "h", sep = "")
print(mse.cv)
## 0.1h 0.2h 0.3h 0.4h 0.5h 1h 1.5h 2h 3h 5h
## NaN 25.45906 26.68569 27.82027 28.40500 29.87374 29.86411 29.79106 32.35193 43.51064
# the minimum lambda and mse
idx = which.min(mse.cv)
print(c(h.cv[idx], mse.cv[idx]))
## 0.2h
## 0.02030607 25.45905873
Pred_E = KernelRegUni(train$time, train$ozone, testx = test$time,
K = "Epanechnikov", bw = h.cv[which.min(mse.cv)])
plot(train$time, train$ozone, col = "deepskyblue", pch = 19, main = "The curve with optimal bandwith")
points(test$time, test$ozone, col = "darkorange", pch = 19)
lines(test$time, Pred_E, col = "black", lwd = 3)
In our choices, the optimal bandwidth is \(0.2 \lambda = 0.0203\), where \(\lambda\) is calculated by rule-of-thumb. Its MSE is 25.45906. If we choose too small bandwith, some testing samples’ neighbourhood may have no training samples, which results in the NA
in MSE.
Remark Actually, the curve is under-smooth with this optimal bandwidth. Optimal bandwidth can vary based on the choices. We accecpt reasonable tuning parameters and MSE.
b)
n = nrow(train)
p = 2
lambda.1 = (4 / (p+2))^(1/(p+4)) * n^(-1/(p+4)) * sd(train$time)
lambda.2 = (4 / (p+2))^(1/(p+4)) * n^(-1/(p+4)) * sd(train$wind)
# bandwidth
lambda = c(lambda.1, lambda.2)
print(lambda)
## [1] 0.1150768 0.8821737
GaussianKernel_2 <- function(x, y, bw){
exp(- ((x - y)/bw)^2 / 2)
}
KernelReg_2 <- function(x, y, testx, bw){
pred = rep(NA, nrow(testx))
p = length(bw)
for (i in 1:nrow(testx)){
w_mat = matrix(0, nrow = nrow(x), ncol= p)
for(j in 1:p){
w_mat[,j] = GaussianKernel_2(testx[i, j], x[, j], bw[j])
}
w <- apply(w_mat, 1, prod)
pred[i] = sum(y * w) / sum(w)
}
return(pred)
}
Pred_G_multi = KernelReg_2(cbind(train$time, train$wind), train$ozone,
testx = cbind(test$time, test$wind), bw = lambda)
mse <- mean((Pred_G_multi- test$ozone)^2)
print(mse)
## [1] 32.38576
We choose \(\lambda\) as 0.1150768, 0.8821737. The MSE on testing set is 32.3857569.
Remark: Compared to the MSE of rule-of-thumb model (or other models) in question 1: Epanechnikov: 29.8737358 and Gaussian: 30.0758595. We can see our mse is not better than previous one.
c)
First, we would expect the bias rate to remain unchanged, since the distance of a point that is \(\lambda\) away from the target point in both covariates would have enjoy the same rate of bias as in a one-dimensional case. This is can be shown in a two-dimensional Tyler expansion of the density function.
As for the variance, using the same \(\lambda\) would result in \(\lambda^2\) of volume captured by the neighborhood, resulting in a slower rate of variance due to smaller sample size for averaging. In general, for \(p\) variables, we have the following results:
Suppose \(x_1,\ldots,x_n\in\mathbb{R}^p\) are i.i.d. data with density \(f(x)\). Denote \(f_x(x^*)=\lambda^{-p}K(\frac{x^*-x}{\lambda})\) and the kernel method uses \(\hat f(x^*)=\frac{1}{n}\lambda^{-p}\sum_{i=1}^nK(\frac{x^*-x_i}{\lambda})=\frac{1}{n}\sum_{i=1}^nf_{x_i}(x^*)\) to estimate \(f(x^*)\).
For the Bias term, if we redo the Taylor expansion for \(p\)-dimensional function. We have \[ E_{x\sim f(x)}[ f_x(x^*)]=f(x^*)+\frac{\lambda^2}{2}\int_{\mathbb{R}^p}K(y)y^T\nabla^2f(x^*)ydy+o(\lambda^2). \] For variance, we have \[ \text{Var}(f_x(x^*))=E[f_x(x^*)^2]-E^2[f_x(x^*)]=E[f_x(x^*)^2]+O(1). \] And note that \[ \begin{aligned} E[f^2_x(x^*)]=&\lambda^{-2p}\int_{\mathbb{R}^p}K^2(\frac{x^*-x}{\lambda})f(x)dx\\ =&\lambda^{-p}\int_{\mathbb{R}^p}K^2(y)f(x^*-\lambda y)dy\\ =&\lambda^{-p}\Big[f(x^*)\int_{\mathbb{R}^p}K^2(y)dy+O(\lambda)\Big]. \end{aligned} \] Since \(\lambda\rightarrow 0\), we have \(var( f_x(x^*))=\lambda^{-p}f(x^*)\int_{\mathbb{R}^p}K^2(y)dy+O(\lambda^{1-p})\).
Since \(x_i\) are i.i.d., We have \[ \begin{aligned} E[\hat f(x^*)]=&E[f_x(x^*)]=f(x^*)+\frac{\lambda^2}{2}\int_{\mathbb{R}^p}K(y)y^T\nabla^2f(x^*)ydy+o(\lambda^2),\\ \text{Var}(\hat f(x^*))=&\frac{1}{n} \text{Var} (f_x(x^*))=n^{-1}\lambda^{-p}\Big[f(x^*)\int_{\mathbb{R}^p}K^2(y)dy+O(\lambda)\Big]. \end{aligned} \] By matching Bias\(^2\) with variance, we have \(\lambda=Cn^{-1/(p+4)}\).