Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: Kernel Averaging and Dimensionality Issues

In this question, you are required to write your own function for kernel regression using the Nadaraya-Watson estimator. You are not allowed to use any existing functions in R to perform the kernel regression. Let’s first generate our data.

  # generate the training data
  set.seed(432)
  n = 3000
  p = 2
  x = matrix(rnorm(n*p), n, p)
  y = x[, 1]^2 + rnorm(n)

  # define testing data
  x0 = c(1.5, rep(0, p-1))
  1. [15 pts] The first question is to write a kernel regression model that can predict the testing point x0. You function should be in the form of GaussianKReg(x0, x, y, h), where x and y are the training data, h is the bandwidth. The kernel function internally used should be the multivariate Gaussian kernel defined as follows:

\[ K(x_0, x_i) = \exp\left( - \frac{ \lVert x_0 - x_i \rVert_2^2 }{2 h^2} \right) \]

where \(\lVert \cdot \rVert_2^2\) is the notation for the squared Euclidean distance and \(p\) is the dimension of the data. You should then use this kernel function in the Nadaraya-Watson kernel estimator, defined in our lecture. Please make sure that your code would automatically work for different values of dimension \(p\) (see part d). To test your function, use it on your training data, with h = n^(-1/6), which is an optimal choice of bandwidth for \(p = 2\) under some conditions. Report the predicted values for x0 and the error, defined as

\[ \frac{1}{\text{nsim}} \sum_{i=1}^{\text{nsim}} \left( \hat{f}(x_0) - f(x_0) \right)^2. \]

Note that this error metric is similar to HW4, just in the non-parametric sense. The prediction should be reasonably close to the true value due to the large sample size.

Answer:

Here is the code for the Gaussian kernel regression.

  # Gaussian kernel function
  GaussianKReg = function(x0, x, y, h){
    # x0: testing point
    # x: training data
    # y: response    
    # h: bandwidth

    d = rowSums(sweep(x, 2, x0, "-")^2)
    K = exp(-d/(2*h^2))

    return(sum(K*y)/sum(K))
  }

  # define bandwidth
  h = n^(-1/(p+4))

  # test the function
  pred = GaussianKReg(x0, x, y, h)
  pred
## [1] 1.946987
  # calculate the error
  (pred - 1.5^2)^2
## [1] 0.09181712
  1. [15 pts] Now, let’s perform a simulation study to calculate the Bias\(^2\) and Variance of this kernel regression estimator. We will use the same model as in part a). The idea is to repeat a) for many times (nsim \(= 200\)), and then approximate the Bias\(^2\) and Variance based on the understandings we learned in class. Report the Bias\(^2\), Variance and prediction error for predicting x0. A rough idea can be summarized in the following (please relate this to our bias and variance simulation of KNN during last week’s lecture)

Remember that the Bias\(^2\) and Variance are defined as follows:

\[ \text{Bias}^2 = \left[ \mathbb{E} \left[ \hat{f}(x_0) \right] - f(x_0) \right]^2 \]

\[ \text{Var} = \mathbb{E} \left[ \left( \hat{f}(x_0) - \mathbb{E} \left[ \hat{f}(x_0) \right] \right)^2 \right] \]

where \(\hat{f}(x_0)\) is your prediction for the testing point, and the expectation can be approximated by averaging over your simulation runs.

  nsim = 200

  # define the function to calculate the bias^2 and variance
  allpred = rep(NA, nsim)

  for (i in 1:nsim)
  {
    n = 100
    p = 2
    x = matrix(rnorm(n*p), n, p)
    y = x[, 1]^2 + rnorm(n)

    h = n^(-1/(p+4))

    allpred[i] = GaussianKReg(x0, x, y, h)
  }

  # calculate the bias^2 and variance
  x0_bias2 = mean(allpred - 1.5^2)^2
  x0_var = var(allpred)
  x0_error = mean((allpred - 1.5^2)^2)

Hence, the bias\(^2\), variance and prediction error are 0.2997517, 0.1409493 and 0.4399962, respectively.

  1. [20 pts] It may be difficult to understand the bias-variance trade-off with just one choice of \(h\). Hence, let’s consider a range of \(h\) values, with h = n^(-1/6)*seq(0.1, 2, length.out = 50). You should then construct a matrix of size nsim by 50 to record the prediction of each \(h\) in each simulation run. After that, plot your bias\(^2\), variance and prediction error against the \(h\) values. What is your conclusion for the bias-variance trade off?
  # define the bandwidth
  h = n^(-1/6)*seq(0.1, 2, length.out = 50)

  # define the matrix to record the prediction
  allpred = matrix(NA, nsim, length(h))

  for (i in 1:nsim)
  {
    n = 100
    p = 2
    x = matrix(rnorm(n*p), n, p)
    y = x[, 1]^2 + rnorm(n)

    for (j in seq_along(h))
    {
      allpred[i, j] = GaussianKReg(x0, x, y, h[j])
    }
  }

  # calculate the bias^2 and variance
  x0_bias2 = apply(allpred, 2, function(x) mean(x - 1.5^2)^2)
  x0_var = apply(allpred, 2, var)
  x0_error = apply(allpred, 2, function(x) mean((x - 1.5^2)^2))
  # plot the bias^2, variance and prediction error
  plot(h, x0_error, type = "l", xlab = "h", ylab = "Prediction Error", 
       main = "Prediction Error vs. h", col = "red", 
       ylim = c(0, 1.5))
  lines(h, x0_bias2, col = "blue")
  lines(h, x0_var, col = "green")
  legend("topright", legend = c("Bias^2", "Variance", "Prediction Error"), col = c("blue", "green", "red"), lty = 1)

  1. [15 pts] Now we want to see how the performance of the kernel regression changes as the dimension increases. We will use the same mechanics to generate the training data. However, with different values of \(p\), ranging from \(2\) to \(30\), but __with \(h\) fixed as n^(-1/6). Please note that the true value for predicting x0 will remain the same since it only depends on the first variable. Do this following:

After your simulation, calculate the bias\(^2\), variance and prediction error in the same way as your previous question, and plot the result against the number of dimensions you use. What do you observe? If your result is not stable enough to draw conclusions, you can consider to increase the number of simulations or slightly increase the sample size.

I decided to increase the sample size to 300, and I will also use this result for Q2 b).

  # define the bandwidth
  h = n^(-1/6)

  nsim = 1000

  # define the matrix to record the prediction
  allpred = matrix(NA, nsim, 29)

  for (i in 1:nsim)
  {
    n = 300
    p = 30
    x = matrix(rnorm(n*p), n, p)
    y = x[, 1]^2 + rnorm(n)

    for (j in 2:p)
    {
      x0 = c(1.5, rep(0, j-1))
      allpred[i, j-1] = GaussianKReg(x0, x[, 1:j], y, h)
    }
  }

  # calculate the bias^2 and variance
  x0_bias2 = apply(allpred, 2, function(x) mean(x - 1.5^2)^2)
  x0_var = apply(allpred, 2, var)
  x0_error = apply(allpred, 2, function(x) mean((x - 1.5^2)^2))
  # plot the bias^2, variance and prediction error
  plot(2:30, x0_error, type = "l", xlab = "Dimension", ylab = "Prediction Error", 
       main = "Prediction Error vs. Dimension", col = "red", 
       ylim = c(0, 5))
  lines(2:30, x0_bias2, col = "blue")
  lines(2:30, x0_var, col = "green")
  legend("topright", legend = c("Bias^2", "Variance", "Prediction Error"), col = c("blue", "green", "red"), lty = 1)

Question 2: Tree Regression and Adaptive Kernel

We know that tree model can be viewed as an adaptive kernel. We could try to utilize it for the previous question. For this problem, use the rpart package to fit a tree model to the training data. You can use the default settings for the rpart function. Generate your data with the following mechanics.

  # generate the training data
  set.seed(432)
  n = 300
  p = 30
  x = matrix(rnorm(n*p), n, p)
  y = x[, 1]^2 + rnorm(n)

  # define testing data
  x0 = c(1.5, rep(0, p-1))
  1. [15 pts] Fit a tree model using rpart, and predict the target point. For this question, fix the tuning parameter cp as 0.01 and prune your tree. You may also read this documentation to understand how to set this parameter, or ask your AI tools. Report the predicted value for x0 and the prediction error.
  # fit the tree model
  library(rpart)
  rpart.fit <- rpart(y ~ ., data = data.frame(x, y))
  prunedtree = prune(rpart.fit, cp=0.01)

  # predict x0
  tree.pred = predict(prunedtree, newdata = data.frame(t(x0)), type = "vector")
  tree.pred
##        1 
## 3.030402
  # calculate the error 
  (tree.pred - 1.5^2)^2
##         1 
## 0.6090274
  1. [20 pts] Perform a simulation study with nsim \(= 200\) to see how your model performs on average. You do not need to consider varying \(p\) or the tuning parameter cp. Does your result should that it performs better than the NW estimator? Can you comment on why this is happening?
  nsim = 200
  prediction = rep(NA, nsim)

  for (i in 1:nsim)
  {
    n = 300
    p = 30
    x = matrix(rnorm(n*p), n, p)
    y = x[, 1]^2 + rnorm(n)

    rpart.fit <- rpart(y ~ ., data = data.frame(x, y))
    prunedtree = prune(rpart.fit, cp=0.01)

    tree.pred = predict(prunedtree, newdata = data.frame(t(x0)), type = "vector")
    prediction[i] = tree.pred
  }

  # calculate the bias^2 and variance
  x0_bias2 = mean(prediction - 1.5^2)^2
  x0_var = var(prediction)
  x0_error = mean((prediction - 1.5^2)^2)

Hence, the bias\(^2\), variance and prediction error are 3^{-5}, 0.52006 and 0.51749, respectively. All of these are much smaller than the NW kernel estimator.