Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.In this question, you are required to write your own function for
kernel regression using the Nadaraya-Watson estimator. You are not
allowed to use any existing functions in R
to perform the
kernel regression. Let’s first generate our data.
# generate the training data
set.seed(432)
n = 3000
p = 2
x = matrix(rnorm(n*p), n, p)
y = x[, 1]^2 + rnorm(n)
# define testing data
x0 = c(1.5, rep(0, p-1))
x0
. You function should
be in the form of GaussianKReg(x0, x, y, h)
, where
x
and y
are the training data, h
is the bandwidth. The kernel function internally used should be the
multivariate Gaussian kernel defined as follows:\[ K(x_0, x_i) = \exp\left( - \frac{ \lVert x_0 - x_i \rVert_2^2 }{2 h^2} \right) \]
where \(\lVert \cdot \rVert_2^2\) is
the notation for the squared Euclidean distance and \(p\) is the dimension of the data. You
should then use this kernel function in the Nadaraya-Watson kernel
estimator, defined in our lecture. Please make sure that your
code would automatically work for different values of dimension \(p\) (see part d). To test your
function, use it on your training data, with h = n^(-1/6)
,
which is an optimal choice of bandwidth for \(p = 2\) under some conditions. Report the
predicted values for x0
and the error, defined as
\[ \frac{1}{\text{nsim}} \sum_{i=1}^{\text{nsim}} \left( \hat{f}(x_0) - f(x_0) \right)^2. \]
Note that this error metric is similar to HW4, just in the non-parametric sense. The prediction should be reasonably close to the true value due to the large sample size.
nsim
\(= 200\)), and then approximate the
Bias\(^2\) and Variance based on the
understandings we learned in class. Report the Bias\(^2\), Variance and prediction error for
predicting x0
. A rough idea can be summarized in the
following (please relate this to our bias and variance simulation of KNN
during last week’s lecture)nsim
to record the
prediction of each simulation.n
\(= 100\), not \(3000\).x0
using the kernel regression function you
wrote in part a).Remember that the Bias\(^2\) and Variance are defined as follows:
\[ \text{Bias}^2 = \left[ \mathbb{E} \left[ \hat{f}(x_0) \right] - f(x_0) \right]^2 \]
\[ \text{Var} = \mathbb{E} \left[ \left( \hat{f}(x_0) - \mathbb{E} \left[ \hat{f}(x_0) \right] \right)^2 \right] \]
where \(\hat{f}(x_0)\) is your prediction for the testing point, and the expectation can be approximated by averaging over your simulation runs.
[20 pts] It may be difficult to understand the bias-variance
trade-off with just one choice of \(h\). Hence, let’s consider a range of \(h\) values, with
h = n^(-1/6)*seq(0.1, 2, length.out = 50)
. You should then
construct a matrix of size nsim
by 50
to
record the prediction of each \(h\) in
each simulation run. After that, plot your bias\(^2\), variance and prediction error against
the \(h\) values. What is your
conclusion for the bias-variance trade off?
[15 pts] Now we want to see how the performance of the kernel
regression changes as the dimension increases. We will use the same
mechanics to generate the training data. However, with different values
of \(p\), ranging from \(2\) to \(30\), but __with \(h\) fixed as n^(-1/6)
. Please
note that the true value for predicting x0
will remain the
same since it only depends on the first variable. Do this
following:
nsim
by \(29\) to record the prediction of each
simulation under each choice of dimension.x0
(just the
first \(j\) dimension) using the
training data with just the first \(j\)
dimensions. And record the prediction for each \(j\).After your simulation, calculate the bias\(^2\), variance and prediction error in the same way as your previous question, and plot the result against the number of dimensions you use. What do you observe? If your result is not stable enough to draw conclusions, you can consider to increase the number of simulations or slightly increase the sample size.
We know that tree model can be viewed as an adaptive kernel. We could
try to utilize it for the previous question. For this problem, use the
rpart
package to fit a tree model to the training data. You
can use the default settings for the rpart
function.
Generate your data with the following mechanics.
# generate the training data
set.seed(432)
n = 300
p = 30
x = matrix(rnorm(n*p), n, p)
y = x[, 1]^2 + rnorm(n)
# define testing data
x0 = c(1.5, rep(0, p-1))
[15 pts] Fit a tree model using rpart
, and predict
the target point. For this question, fix the tuning parameter
cp
as 0.01 and prune your tree. You may also read this
documentation to understand how to set this parameter, or ask your
AI tools. Report the predicted value for x0
and the
prediction error.
[20 pts] Perform a simulation study with nsim
\(= 200\) to see how your model performs on
average. You do not need to consider varying \(p\) or the tuning parameter
cp
. Does your result should that it performs better than
the NW estimator? Can you comment on why this is happening?