Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:

Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.

Question 1 [60 pts]: Bias-variance trade-off in kernel smoothing

In this question, you are required to write your own function for kernel regression using the Nadaraya-Watson estimator. You are not allowed to use any existing functions in R to perform the kernel regression. Let’s first generate our data.

  # generate the training data
  set.seed(432)
  n = 3000
  p = 2
  x = matrix(rnorm(n*p), n, p)
  y = x[, 1]^2 + rnorm(n)

  # define testing data
  x0 = c(1.5, rep(0, p-1))
  1. [10 pts] The first question is to write a kernel regression model that can predict a testing point x0 (of dimension \(p\)). You function should be in the form of MyNW(x0, x, y, h), where x (\(n \times p\)) and y(\(n \times 1\)) are the training data, h is the bandwidth. Within the function, you should be using a multivariate Gaussian kernel: \[ K(x_0, x_i) = \exp\left( - \, \frac{ \lVert x_0 - x_i \rVert_2^2 }{2 h^2} \right) \] where \(\lVert \cdot \rVert_2^2\) is the notation for the squared Euclidean distance, and \(h\) is the bandwidth. And here I removed the normalizing constant since it will be cancelled out in the Nadaraya-Watson estimator (in both the numerator and denominator). You should then use this kernel function in the Nadaraya-Watson kernel estimator, defined in our lecture. Please make sure that your code would automatically work for different values of dimension \(p\) (see part d). To test your function, use it on your training data, with h = n^(-1/6), which is an optimal choice of bandwidth for \(p = 2\) under some smoothness conditions. Apply your function to predict x0, and report:

Note that the idea of this error metric is similar to HW4, but in the non-parametric sense. The prediction should be reasonably close to the true value due to the large sample size.

  1. [15 pts] Now, let’s perform a simulation study to calculate the Bias\(^2\) and Variance of this kernel regression estimator. We will use the same model as in part a), but change the setting to \(n = 100\). The idea is to repeat a) for many times (nsim \(= 200\)), and then approximate the Bias\(^2\) and Variance based on the understandings we have learned previously. Report your estimated Bias\(^2\), Variance (see our previous homework) and the average squared error \[ \frac{1}{\text{nsim}} \sum_{i=1}^{\text{nsim}} \left( \hat{f}_i(x_0) - f(x_0) \right)^2 \]

  2. [20 pts] We cannot understand the bias-variance trade-off with just one choice of \(h\). Hence, let’s consider a range of 50 different \(h\) values, with h = n^(-1/6)*seq(0.1, 2, length.out = 50). You should then construct a matrix of size nsim by 50 to record the prediction of each \(h\) in each simulation run. After that, plot your bias\(^2\), variance and prediction error against the \(h\) values in a single figure. Summerize the pattern that you see in this simulation. Does it match our under standing of the bias-variance trade-off?

  3. [15 pts] Now we want to see how the performance of the kernel regression changes as the dimension increases. We will use the same mechanics to generate the training data. However, with different values of \(p\), ranging from \(2\) to \(31\), but __with \(h\) fixed as n^(-1/6). Please note that the true value for predicting x0 will remain the same since it only depends on the first variable. Do the following:

After your simulation, calculate the bias\(^2\), variance and prediction error in the same way as your previous question, and plot the result against the number of dimensions you used. What do you observe? If your result is not stable enough to draw conclusions, you can consider to increase the number of simulations or slightly increase the sample size.

Question 2 [40 pts]: Intrinsic Low Dimension

For this question, let’s consider the effect of low-dimensional manifold on KNN regression. We will consider a setting with some latent structures. You should generate \(n = 400\) observations with \(p = 100\). Use the first 200 as the training data and the rest as testing data. Perform the following steps:

Hence, the expected outcome depends only on the first two latent variables. The goal of this experiment is to observe how the KNN model could be affected by the dimension of the latent space. Please keep in mind that you do not observe \(Z\) in a real world, and you can only observe \(X\) and \(\mathbf{y}\). Perform the following:

  1. [15 pts] Fit a KNN regression using the generated data with \(m = 3\), and predict the corresponding testing data. Vary \(k\) in a grid seq(2, 82, 4). What are the testing errors? Repeat this experiment with \(m = 30\), and report the testing errors.

  2. [15 pts] Now let’s perform a simulation study that repeats 50 times. You should re-generate \(A\) each time. Average your prediction errors (of each \(k\)) over the simulation runs, just like the simulations we have done in previous HW. At the end, you should show a plot that summarizes and compare the prediction errors of two different settings, respectively. For example, using \(k\) as the horizontal axis, and prediction errors in the vertical axis, with two line, each representing a setting of \(m\).

  3. [10 points] In both settings, we are still using 100 variables to fit the KNN model, but the performances are very different. Can you comment on the results? Which setting is easier for KNN to obtain a better fit? Why is that?