Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Kernel Averaging and Dimensionality Issues

In this question, you are required to write your own function for kernel regression using the Nadaraya-Watson estimator. You are not allowed to use any existing functions in R to perform the kernel regression. Let’s first generate our data.

  # generate the training data
  set.seed(432)
  n = 3000
  p = 2
  x = matrix(rnorm(n*p), n, p)
  y = x[, 1]^2 + rnorm(n)

  # define testing data
  x0 = c(1.5, rep(0, p-1))
  1. [15 pts] The first question is to write a kernel regression model that can predict the testing point x0. You function should be in the form of GaussianKReg(x0, x, y, h), where x and y are the training data, h is the bandwidth. The kernel function internally used should be the multivariate Gaussian kernel defined as follows:

\[ K(x_0, x_i) = \exp\left( - \frac{ \lVert x_0 - x_i \rVert_2^2 }{2 h^2} \right) \]

where \(\lVert \cdot \rVert_2^2\) is the notation for the squared Euclidean distance and \(p\) is the dimension of the data. You should then use this kernel function in the Nadaraya-Watson kernel estimator, defined in our lecture. Please make sure that your code would automatically work for different values of dimension \(p\) (see part d). To test your function, use it on your training data, with h = n^(-1/6), which is an optimal choice of bandwidth for \(p = 2\) under some conditions. Report the predicted values for x0 and the error, defined as

\[ \frac{1}{\text{nsim}} \sum_{i=1}^{\text{nsim}} \left( \hat{f}(x_0) - f(x_0) \right)^2. \]

Note that this error metric is similar to HW4, just in the non-parametric sense. The prediction should be reasonably close to the true value due to the large sample size.

  1. [15 pts] Now, let’s perform a simulation study to calculate the Bias\(^2\) and Variance of this kernel regression estimator. We will use the same model as in part a). The idea is to repeat a) for many times (nsim \(= 200\)), and then approximate the Bias\(^2\) and Variance based on the understandings we learned in class. Report the Bias\(^2\), Variance and prediction error for predicting x0. A rough idea can be summarized in the following (please relate this to our bias and variance simulation of KNN during last week’s lecture)

Remember that the Bias\(^2\) and Variance are defined as follows:

\[ \text{Bias}^2 = \left[ \mathbb{E} \left[ \hat{f}(x_0) \right] - f(x_0) \right]^2 \]

\[ \text{Var} = \mathbb{E} \left[ \left( \hat{f}(x_0) - \mathbb{E} \left[ \hat{f}(x_0) \right] \right)^2 \right] \]

where \(\hat{f}(x_0)\) is your prediction for the testing point, and the expectation can be approximated by averaging over your simulation runs.

  1. [20 pts] It may be difficult to understand the bias-variance trade-off with just one choice of \(h\). Hence, let’s consider a range of \(h\) values, with h = n^(-1/6)*seq(0.1, 2, length.out = 50). You should then construct a matrix of size nsim by 50 to record the prediction of each \(h\) in each simulation run. After that, plot your bias\(^2\), variance and prediction error against the \(h\) values. What is your conclusion for the bias-variance trade off?

  2. [15 pts] Now we want to see how the performance of the kernel regression changes as the dimension increases. We will use the same mechanics to generate the training data. However, with different values of \(p\), ranging from \(2\) to \(30\), but __with \(h\) fixed as n^(-1/6). Please note that the true value for predicting x0 will remain the same since it only depends on the first variable. Do this following:

After your simulation, calculate the bias\(^2\), variance and prediction error in the same way as your previous question, and plot the result against the number of dimensions you use. What do you observe? If your result is not stable enough to draw conclusions, you can consider to increase the number of simulations or slightly increase the sample size.

Question 2: Tree Regression and Adaptive Kernel

We know that tree model can be viewed as an adaptive kernel. We could try to utilize it for the previous question. For this problem, use the rpart package to fit a tree model to the training data. You can use the default settings for the rpart function. Generate your data with the following mechanics.

  # generate the training data
  set.seed(432)
  n = 300
  p = 30
  x = matrix(rnorm(n*p), n, p)
  y = x[, 1]^2 + rnorm(n)

  # define testing data
  x0 = c(1.5, rep(0, p-1))
  1. [15 pts] Fit a tree model using rpart, and predict the target point. For this question, fix the tuning parameter cp as 0.01 and prune your tree. You may also read this documentation to understand how to set this parameter, or ask your AI tools. Report the predicted value for x0 and the prediction error.

  2. [20 pts] Perform a simulation study with nsim \(= 200\) to see how your model performs on average. You do not need to consider varying \(p\) or the tuning parameter cp. Does your result should that it performs better than the NW estimator? Can you comment on why this is happening?