Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Multivariate Kernel Regression Simulation (45 pts)

Similar to the previous homework, we will use simulated datasets to evaluate a kernel regression model. You should write your own code to complete this question. We use two-dimensional data generator:

\[ Y = \exp(\beta^T x) + \epsilon \]

where \(\beta = c(1, 1)\), \(X\) is generated uniformly from \([0, 1]^2\), and \(\epsilon\) follows i.i.d. standard Gaussian. Use the following code to generate a set of training and testing data:

  set.seed(2)
  trainn <- 200
  testn <- 1
  p = 2
  beta <- c(1.5, 1.5)
  
  # generate data
  
  Xtrain <- matrix(runif(trainn * p), ncol = p)
  Ytrain <- exp(Xtrain %*% beta) + rnorm(trainn)
  Xtest <- matrix(runif(testn * p), ncol = p)

  # the first testing observation
  Xtest
##           [,1]      [,2]
## [1,] 0.4152441 0.5314388
  # the true expectation of the first testing observation
  exp(Xtest %*% beta)
##          [,1]
## [1,] 4.137221
  1. [10 pts] For this question, you need to write your own code for implementing a two-dimensional Nadaraya-Watson kernel regression estimator, and predict just the first testing observation. For this task, we will use independent Gaussian kernel function introduced during the lecture. Use the same bandwidth \(h\) for both dimensions. As a starting point, use \(h = 0.07\). What is your predicted value?

  2. [20 pts] Based on our previous understanding the bias-variance trade-off of KNN, do the same simulation analysis for the kernel regression model. Again, you only need to consider the predictor of this one testing point. Your simulation needs to be able to calculate the following quantities:

Use at least 5000 simulation runs. Based on your simulation, answer the following questions:

  1. [15 pts] In practice, we will have to use cross-validation to select the optimal bandwidth. However, if you have the power of simulating as many datasets as you can, and you also know the true model, how would you find the optimal bandwidth for the bias-variance trade-off for this particular model and sample size? Provide enough evidence to claim that your selected bandwidth is (almost) optimal.

Question 2: Local Polynomial Regression (55 pts)

We introduced the local polynomial regression in the lecture, with the objective function for predicting a target point \(x_0\) defined as

\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}_{x_0})^\text{T} \mathbf{W} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}_{x_0}), \]

where \(W\) is a diagonal weight matrix, with the \(i\)th diagonal element defined as \(K_h(x_0, x_i)\), the kernel distance between \(x_i\) and \(x_0\). In this question, we will write our own code to implement this model. We will use the same simulated data provided at the beginning of Question 1.

  set.seed(2)
  trainn <- 200
  testn <- 1
  p = 2
  beta <- c(1.5, 1.5)
  
  # generate data
  
  Xtrain <- matrix(runif(trainn * p), ncol = p)
  Ytrain <- exp(Xtrain %*% beta) + rnorm(trainn)
  Xtest <- matrix(runif(testn * p), ncol = p)
  1. [10 pts] Using the same kernel function as Question 1, calculate the kernel weights of \(x_0\) against all observed training data points. Report the 25th, 50th and 75th percentiles of the weights so we can check your answer.

  2. [15 pts] Based on the objective function, derive the normal equation for estimating the local polynomial regression in matrix form. And then define the estimated \(\boldsymbol{\beta}_{x_0}\). Write your answer in latex.

  3. [10 pts] Based on the observed data provided in Question 1, calculate the estimated \(\boldsymbol{\beta}_{x_0}\) for the testing point Xtest using the formula you derived. Report the estimated \(\boldsymbol{\beta}_{x_0}\). Calculate the prediction on the testing point and compare it with the true expectation.

  4. [20 pts] Now, let’s use this model to predict the following 100 testing points. After you fit the model, provide a scatter plot of the true expectation versus the predicted values on these testing points. Does this seem to be a good fit? As a comparison, fit a global linear regression model to the training data and predict the testing points. Does your local linear model outperforms the global linear mode? Note: this is not a simulation study. You should use the same training data provided previously.

  set.seed(432)
  testn <- 100
  Xtest <- matrix(runif(testn * p), ncol = p)