Stat 432 Homework 6

Instruction
Question 1 [60 pts]: Bias-variance trade-off in kernel smoothing
Question 2 [40 pts]: Intrinsic Low Dimension

Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:

sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. This simply means: write your own solution.
You must indicate clearly the level of involvement of AI tools (ChatGPT, Copilot, etc.) at the beginning of your homework. This includes whether you used it to generate code, debug code, or write explanations. If you did not use any AI tools, please state that explicitly. You are encouraged to briefly document your prompt(s) especially when AI is heavily involved. Please be aware that you are responsible any consequences of using AI tools, including incorrect or misleading information.

Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.

You are required to submit the rendered file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file. .html format will not be accepted because they are often not readable on gradescope. Make all of your R code chunks visible for grading.
Include your Name and NetID at the beginning of your homework.
Make sure that you set seed properly so that the results can be replicated if needed. Make sure the version of your R is \(\geq 4.0.0\). This will ensure your random seed generation is the same as everyone else.
For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.
If you use this file or the example homework .Rmd file as a template, be sure to remove this instruction section.

Question 1 [60 pts]: Bias-variance trade-off in kernel smoothing

In this question, you are required to write your own function for kernel regression using the Nadaraya-Watson estimator. You are not allowed to use any existing functions in R to perform the kernel regression. Let’s first generate our data.

  # generate the training data
  set.seed(432)
  n = 3000
  p = 2
  x = matrix(rnorm(n*p), n, p)
  y = x[, 1]^2 + rnorm(n)

  # define testing data
  x0 = c(1.5, rep(0, p-1))

[10 pts] The first question is to write a kernel regression model that can predict a testing point x0 (of dimension \(p\)). You function should be in the form of MyNW(x0, x, y, h), where x (\(n \times p\)) and y(\(n \times 1\)) are the training data, h is the bandwidth. Within the function, you should be using a multivariate Gaussian kernel: \[ K(x_0, x_i) = \exp\left( - \, \frac{ \lVert x_0 - x_i \rVert_2^2 }{2 h^2} \right) \] where \(\lVert \cdot \rVert_2^2\) is the notation for the squared Euclidean distance, and \(h\) is the bandwidth. And here I removed the normalizing constant since it will be cancelled out in the Nadaraya-Watson estimator (in both the numerator and denominator). You should then use this kernel function in the Nadaraya-Watson kernel estimator, defined in our lecture. Please make sure that your code would automatically work for different values of dimension \(p\) (see part d). To test your function, use it on your training data, with h = n^(-1/6), which is an optimal choice of bandwidth for \(p = 2\) under some smoothness conditions. Apply your function to predict x0, and report:

The prediction \(\hat{f}(x_0)\)
The squared error, \(\left( \hat{f}(x_0) - f(x_0) \right)^2\), where \(f(x_0)\) is the true value at x0.

Note that the idea of this error metric is similar to HW4, but in the non-parametric sense. The prediction should be reasonably close to the true value due to the large sample size.

[15 pts] Now, let’s perform a simulation study to calculate the Bias\(^2\) and Variance of this kernel regression estimator. We will use the same model as in part a), but change the setting to \(n = 100\). The idea is to repeat a) for many times (nsim \(= 200\)), and then approximate the Bias\(^2\) and Variance based on the understandings we have learned previously. Report your estimated Bias\(^2\), Variance (see our previous homework) and the average squared error \[ \frac{1}{\text{nsim}} \sum_{i=1}^{\text{nsim}} \left( \hat{f}_i(x_0) - f(x_0) \right)^2 \]
[20 pts] We cannot understand the bias-variance trade-off with just one choice of \(h\). Hence, let’s consider a range of 50 different \(h\) values, with h = n^(-1/6)*seq(0.1, 2, length.out = 50). You should then construct a matrix of size nsim by 50 to record the prediction of each \(h\) in each simulation run. After that, plot your bias\(^2\), variance and prediction error against the \(h\) values in a single figure. Summerize the pattern that you see in this simulation. Does it match our under standing of the bias-variance trade-off?
[15 pts] Now we want to see how the performance of the kernel regression changes as the dimension increases. We will use the same mechanics to generate the training data. However, with different values of \(p\), ranging from \(2\) to \(31\), but __with \(h\) fixed as n^(-1/6). Please note that the true value for predicting x0 will remain the same since it only depends on the first variable. Do the following:

Prepare a matrix of size nsim by \(30\) to record the prediction of each simulation under each choice of dimension.
For each simulation run, generate a new training data with dimension \(31\), using the same mechanics as in part a).
Do a for-loop over \(j\) from \(2\) to \(31\). For each \(j\), predict x0 (just the first \(j\) dimension) using the training data with just the first \(j\) dimensions. And record the prediction for each choice of \(j\).
For this question, increase the number of simulations to nsim = 500, and increase the sample size to \(n = 300\).

After your simulation, calculate the bias\(^2\), variance and prediction error in the same way as your previous question, and plot the result against the number of dimensions you used. What do you observe? If your result is not stable enough to draw conclusions, you can consider to increase the number of simulations or slightly increase the sample size.

Question 2 [40 pts]: Intrinsic Low Dimension

For this question, let’s consider the effect of low-dimensional manifold on KNN regression. We will consider a setting with some latent structures. You should generate \(n = 400\) observations with \(p = 100\). Use the first 200 as the training data and the rest as testing data. Perform the following steps:

Set \(m\) to be the number of latent dimensions, we will use \(m = 3\) and \(30\)
For a given \(m\), we will generate two data matrices. First, \(Z\) is a latent data with \(n\) observations and \(m\) variables. However, we will not observe \(Z\) in the real world. Instead, we observe \(X\), which is a design matrix with \(n\) observations and \(p\) variables. \(X\) and \(Z\) have the following relationship. For each observation (a row in your data matrix), \[ x = Az + \eta \] where \(x\) is a \(p\) dimensional vector, \(z\) is a \(m\) dimensional vector and \(A\) is a \(p \times m\) dimensional matrix. \(\eta\) is a \(p\) dimensional noise vector, with i.i.d. \(\cal N(0, 1)\) entries. You may refer to HW1 for a similar transformation of normal variables between \(x\) and \(z\), but we are using additional noise here. To generate your data, first generate a \(p \times m\) dimensional matrix \(A\), with i.i.d. entries from Uniform\([-2, 2]\). Then use this fixed matrix for generating all of your observations. Then you can generate both \(z\) and \(\eta\) with i.i.d. standard normal entries. Save both your \(Z\) and \(X\) matrices.
The outcome Y is generated using \[ \mathbf{y} = Z_1 + Z_2 + \epsilon,\] where \(\epsilon\) follows i.i.d. \(\cal N(0, 1)\). Use this relationship for all your \(m\) settings.

Hence, the expected outcome depends only on the first two latent variables. The goal of this experiment is to observe how the KNN model could be affected by the dimension of the latent space. Please keep in mind that you do not observe \(Z\) in a real world, and you can only observe \(X\) and \(\mathbf{y}\). Perform the following:

[15 pts] Fit a KNN regression using the generated data with \(m = 3\), and predict the corresponding testing data. Vary \(k\) in a grid seq(2, 82, 4). What are the testing errors? Repeat this experiment with \(m = 30\), and report the testing errors.
[15 pts] Now let’s perform a simulation study that repeats 50 times. You should re-generate \(A\) each time. Average your prediction errors (of each \(k\)) over the simulation runs, just like the simulations we have done in previous HW. At the end, you should show a plot that summarizes and compare the prediction errors of two different settings, respectively. For example, using \(k\) as the horizontal axis, and prediction errors in the vertical axis, with two line, each representing a setting of \(m\).
[10 points] In both settings, we are still using 100 variables to fit the KNN model, but the performances are very different. Can you comment on the results? Which setting is easier for KNN to obtain a better fit? Why is that?

Stat 432 Homework 6

Assigned: Oct 2, 2025; Due: 11:59 PM CT, Oct 9, 2025

Instruction

Question 1 [60 pts]: Bias-variance trade-off in kernel smoothing

Question 2 [40 pts]: Intrinsic Low Dimension