Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.In this question, you are required to write your own function for
kernel regression using the Nadaraya-Watson estimator. You are not
allowed to use any existing functions in R
to perform the
kernel regression. Let’s first generate our data.
# generate the training data
set.seed(432)
n = 3000
p = 2
x = matrix(rnorm(n*p), n, p)
y = x[, 1]^2 + rnorm(n)
# define testing data
x0 = c(1.5, rep(0, p-1))
x0
. You function should
be in the form of GaussianKReg(x0, x, y, h)
, where
x
and y
are the training data, h
is the bandwidth. The kernel function internally used should be the
multivariate Gaussian kernel defined as follows:\[ K(x_0, x_i) = \exp\left( - \frac{ \lVert x_0 - x_i \rVert_2^2 }{2 h^2} \right) \]
where \(\lVert \cdot \rVert_2^2\) is
the notation for the squared Euclidean distance and \(p\) is the dimension of the data. You
should then use this kernel function in the Nadaraya-Watson kernel
estimator, defined in our lecture. Please make sure that your
code would automatically work for different values of dimension \(p\) (see part d). To test your
function, use it on your training data, with h = n^(-1/6)
,
which is an optimal choice of bandwidth for \(p = 2\) under some conditions. Report the
predicted values for x0
and the error, defined as
\[ \frac{1}{\text{nsim}} \sum_{i=1}^{\text{nsim}} \left( \hat{f}(x_0) - f(x_0) \right)^2. \]
Note that this error metric is similar to HW4, just in the non-parametric sense. The prediction should be reasonably close to the true value due to the large sample size.
Answer:
Here is the code for the Gaussian kernel regression.
# Gaussian kernel function
GaussianKReg = function(x0, x, y, h){
# x0: testing point
# x: training data
# y: response
# h: bandwidth
d = rowSums(sweep(x, 2, x0, "-")^2)
K = exp(-d/(2*h^2))
return(sum(K*y)/sum(K))
}
# define bandwidth
h = n^(-1/(p+4))
# test the function
pred = GaussianKReg(x0, x, y, h)
pred
## [1] 1.946987
# calculate the error
(pred - 1.5^2)^2
## [1] 0.09181712
nsim
\(= 200\)), and then approximate the
Bias\(^2\) and Variance based on the
understandings we learned in class. Report the Bias\(^2\), Variance and prediction error for
predicting x0
. A rough idea can be summarized in the
following (please relate this to our bias and variance simulation of KNN
during last week’s lecture)nsim
to record the
prediction of each simulation.n
\(= 100\), not \(3000\).x0
using the kernel regression function you
wrote in part a).Remember that the Bias\(^2\) and Variance are defined as follows:
\[ \text{Bias}^2 = \left[ \mathbb{E} \left[ \hat{f}(x_0) \right] - f(x_0) \right]^2 \]
\[ \text{Var} = \mathbb{E} \left[ \left( \hat{f}(x_0) - \mathbb{E} \left[ \hat{f}(x_0) \right] \right)^2 \right] \]
where \(\hat{f}(x_0)\) is your prediction for the testing point, and the expectation can be approximated by averaging over your simulation runs.
nsim = 200
# define the function to calculate the bias^2 and variance
allpred = rep(NA, nsim)
for (i in 1:nsim)
{
n = 100
p = 2
x = matrix(rnorm(n*p), n, p)
y = x[, 1]^2 + rnorm(n)
h = n^(-1/(p+4))
allpred[i] = GaussianKReg(x0, x, y, h)
}
# calculate the bias^2 and variance
x0_bias2 = mean(allpred - 1.5^2)^2
x0_var = var(allpred)
x0_error = mean((allpred - 1.5^2)^2)
Hence, the bias\(^2\), variance and prediction error are 0.2997517, 0.1409493 and 0.4399962, respectively.
h = n^(-1/6)*seq(0.1, 2, length.out = 50)
. You should then
construct a matrix of size nsim
by 50
to
record the prediction of each \(h\) in
each simulation run. After that, plot your bias\(^2\), variance and prediction error against
the \(h\) values. What is your
conclusion for the bias-variance trade off? # define the bandwidth
h = n^(-1/6)*seq(0.1, 2, length.out = 50)
# define the matrix to record the prediction
allpred = matrix(NA, nsim, length(h))
for (i in 1:nsim)
{
n = 100
p = 2
x = matrix(rnorm(n*p), n, p)
y = x[, 1]^2 + rnorm(n)
for (j in seq_along(h))
{
allpred[i, j] = GaussianKReg(x0, x, y, h[j])
}
}
# calculate the bias^2 and variance
x0_bias2 = apply(allpred, 2, function(x) mean(x - 1.5^2)^2)
x0_var = apply(allpred, 2, var)
x0_error = apply(allpred, 2, function(x) mean((x - 1.5^2)^2))
# plot the bias^2, variance and prediction error
plot(h, x0_error, type = "l", xlab = "h", ylab = "Prediction Error",
main = "Prediction Error vs. h", col = "red",
ylim = c(0, 1.5))
lines(h, x0_bias2, col = "blue")
lines(h, x0_var, col = "green")
legend("topright", legend = c("Bias^2", "Variance", "Prediction Error"), col = c("blue", "green", "red"), lty = 1)
n^(-1/6)
. Please
note that the true value for predicting x0
will remain the
same since it only depends on the first variable. Do this
following:nsim
by \(29\) to record the prediction of each
simulation under each choice of dimension.x0
(just the
first \(j\) dimension) using the
training data with just the first \(j\)
dimensions. And record the prediction for each \(j\).After your simulation, calculate the bias\(^2\), variance and prediction error in the same way as your previous question, and plot the result against the number of dimensions you use. What do you observe? If your result is not stable enough to draw conclusions, you can consider to increase the number of simulations or slightly increase the sample size.
I decided to increase the sample size to 300, and I will also use this result for Q2 b).
# define the bandwidth
h = n^(-1/6)
nsim = 1000
# define the matrix to record the prediction
allpred = matrix(NA, nsim, 29)
for (i in 1:nsim)
{
n = 300
p = 30
x = matrix(rnorm(n*p), n, p)
y = x[, 1]^2 + rnorm(n)
for (j in 2:p)
{
x0 = c(1.5, rep(0, j-1))
allpred[i, j-1] = GaussianKReg(x0, x[, 1:j], y, h)
}
}
# calculate the bias^2 and variance
x0_bias2 = apply(allpred, 2, function(x) mean(x - 1.5^2)^2)
x0_var = apply(allpred, 2, var)
x0_error = apply(allpred, 2, function(x) mean((x - 1.5^2)^2))
# plot the bias^2, variance and prediction error
plot(2:30, x0_error, type = "l", xlab = "Dimension", ylab = "Prediction Error",
main = "Prediction Error vs. Dimension", col = "red",
ylim = c(0, 5))
lines(2:30, x0_bias2, col = "blue")
lines(2:30, x0_var, col = "green")
legend("topright", legend = c("Bias^2", "Variance", "Prediction Error"), col = c("blue", "green", "red"), lty = 1)
We know that tree model can be viewed as an adaptive kernel. We could
try to utilize it for the previous question. For this problem, use the
rpart
package to fit a tree model to the training data. You
can use the default settings for the rpart
function.
Generate your data with the following mechanics.
# generate the training data
set.seed(432)
n = 300
p = 30
x = matrix(rnorm(n*p), n, p)
y = x[, 1]^2 + rnorm(n)
# define testing data
x0 = c(1.5, rep(0, p-1))
rpart
, and predict the
target point. For this question, fix the tuning parameter
cp
as 0.01 and prune your tree. You may also read this
documentation to understand how to set this parameter, or ask your
AI tools. Report the predicted value for x0
and the
prediction error. # fit the tree model
library(rpart)
rpart.fit <- rpart(y ~ ., data = data.frame(x, y))
prunedtree = prune(rpart.fit, cp=0.01)
# predict x0
tree.pred = predict(prunedtree, newdata = data.frame(t(x0)), type = "vector")
tree.pred
## 1
## 3.030402
# calculate the error
(tree.pred - 1.5^2)^2
## 1
## 0.6090274
nsim
\(= 200\) to see how your model performs on
average. You do not need to consider varying \(p\) or the tuning parameter
cp
. Does your result should that it performs better than
the NW estimator? Can you comment on why this is happening? nsim = 200
prediction = rep(NA, nsim)
for (i in 1:nsim)
{
n = 300
p = 30
x = matrix(rnorm(n*p), n, p)
y = x[, 1]^2 + rnorm(n)
rpart.fit <- rpart(y ~ ., data = data.frame(x, y))
prunedtree = prune(rpart.fit, cp=0.01)
tree.pred = predict(prunedtree, newdata = data.frame(t(x0)), type = "vector")
prediction[i] = tree.pred
}
# calculate the bias^2 and variance
x0_bias2 = mean(prediction - 1.5^2)^2
x0_var = var(prediction)
x0_error = mean((prediction - 1.5^2)^2)
Hence, the bias\(^2\), variance and prediction error are 3^{-5}, 0.52006 and 0.51749, respectively. All of these are much smaller than the NW kernel estimator.