Instruction
About HW6
Question 1 [40 Points] Write Your Own Spline Basis (Univariate Spline Fit)
Question 2 [60 Points] Kernel Regression

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

What is expected for the submission to Gradescope
- You are required to submit one rendered PDF file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file generated by a .Rmd file. .html format cannot be accepted.
- Please follow the instructions on Gradescope to select corresponding PDF pages for each question.
Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
- Your Name and NetID. (Replace Ruoqing Zhu (rqzhu) by your name and NetID if you are using this template).
- Make all of your R code chunks visible for grading.
- Relevant outputs from your R code chunks that support your answers.
- Provide clear answers or conclusions for each question. For example, you could start with Answer: I fit SVM with the following choice of tuning parameters ...
- Many assignments require your own implementation of algorithms. Basic comments are strongly encouraged to explain the logic to our graders. However, line-by-line code comments are unnecessary.
Requirements regarding the .Rmd file.
- You do NOT need to submit Rmd files. However, your PDF file should be rendered directly from it.
- Make sure that you set random seeds for simulation or randomized algorithms so that the results are reproducible. If a specific seed number is not provided in the homework, you can consider using your NetID.
- For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.

About HW6

Understand how the spline basis is constructed. We will use the Ozone data from the mlbench package.
Kernel regression involves two decisions: choosing the kernel and tuning the bandwidth. Usually, tuning the bandwidth is more influential than choosing the kernel function. Tuning the bandwidth is similar to tuning \(k\) in a KNN model. However, this is more difficult in multi-dimensional models. We practice one and two-dimensional kernels that involves these elements.

Question 1 [40 Points] Write Your Own Spline Basis (Univariate Spline Fit)

We will fit and compare different spline models to the Ozone dataset form the mlbench package. We have pre-processed this dataset. Please use our provided train/test csv files. This dataset has three variables: time, ozone and wind. For the spline question, we will only use time as the covariate and ozone as the outcome.

  train = read.csv("..//data//ozoneTrain.csv")
  test  = read.csv("..//data//ozoneTest.csv")
  par(mfrow=c(1,2))

  plot(train$time, train$ozone, pch = 19, cex = 0.5)
  plot(train$wind + runif(nrow(train), -0.15, 0.15), 
       train$ozone, pch = 19, cex = 0.5)

Let’s consider several different spline constructions to model the ozone level using time. Please read the requirements carefully since some of them require you to write your own code. - To test your model, use the train/test split provided above. - Use the mean squared error as the metric for evaluation and report the MSE for each method in a single summary table. - For the spline basis that you write with your own code, make sure to include the intercept term. - For question a) and b) and d), provide the following summary information at the end of your answer: - A table of MSE on testing data. Please label each method clearly. - Three figures (you could consider making them side-by-side for a better comparison) at the end. Each figure consists of scatter plots of both training and testing data points and the fitted curve on the range x = seq(0, 1, by = 0.01).

[10 pts] Write your own code (you cannot use bs() or similar functions) to implement a continuous piece-wise linear fitting. Pick 4 knots at \((0.2, 0.4, 0.6, 0.8)\).
[10 pts] Write your own code to implement a quadratic spline fitting. Your spline should be continuous up to the first derivative. Pick 4 knots as \((0.2, 0.4, 0.6, 0.8)\).
[10 pts] Produce a set of B-spline basis with the same knots and degrees as (b) using the bs() function. Note that they do not have to be exactly the same as yours. Compare the design matrix from b) and c) as follows:
- Check the the difference between their projection matrices (the hat-matrices of the corresponding linear regression) on the training data to verify that the column spaces are the same. The difference is measured by \(\text{max}_{i, j} |H_{i, j}^{(1)} - H_{i, j}^{(2)}|\), where \(H^{(1)}\) and \(H^{(2)}\) are corresponding to the two hat-matrices.
- Compare the conditional number \(\frac{\sigma_\text{max}}{\sigma_\text{min}}\) of each deign matrix, where \(\sigma_\text{max}\) and \(\sigma_\text{min}\) denote the largest and smallest singular values, respectively.
- [bonus 2 pts] Why B-spline has a smaller condition number even though two design matrices have the same column space. Some basic information of the conditional number (for linear regression) can be found here.
[10 pts] Use existing functions to implement a smoothing spline. Use the built-in generalized cross-validation method to select the best tuning parameter.

Question 2 [60 Points] Kernel Regression

We will use the same ozone data. For Question a), we only use time as the covariate, while in Question b, we use both time and wind.

a) [30 pts] One-dimensional kernel regression.

You are required to implement (write your own code) two kernel regression models, using the following two kernels function, respectively:

Gaussian kernel, defined as \(K(u) = \frac{1}{\sqrt{2 \pi}} e^{- \frac{u^2}{2}}\)
Epanechnikov kernel, defined as \(K(u) = \frac{3}{4}(1-u^2)\) for \(|u| \leq 1\).

For both kernel functions, incorporate a bandwidth \(\lambda\). You should start with the Silverman’s rule-of-thumb for the choice of \(\lambda\), and then tune \(\lambda\) (for example, increase or decrease by 10%, 20%, 30% etc.). Then, perform the following:

1. [20 pts] Using just the Silverman’s rule-of-thumb, fit and plot the regression line with both kernel functions, in a single figure. Add the training/testing points, just like Question 1. Report the testing MSE of both methods in a table.
1. [10 pts] For the Epanechnikov kernel, tune the \(\lambda\) value by minimizing the testing error. Use a grid of 10 different \(\lambda\) values at your choice. What is the best \(\lambda\) value that minimizes the testing error? Plot your optimal regression line and report the best \(\lambda\) and the testing error.

b [25 Points] Two-dimensional Kernel

We consider using both time and wind in the regression. We use the following multivariate kernel function, which is essentially a Gaussian kernel with diagonal covariance matrix. You can also view this as the product of two kernel functions, corresponding to the two variables:

\[ K_{\boldsymbol \lambda}(x, z) \propto \exp\Big\{ -\frac{1}{2} \sum_{j=1}^p \big( (x_j - z_j)/\lambda_j \big)^2 \Big\}\] Based on the Silverman’s formula, the bandwidth for the \(j\)th variable is given by \[\lambda_k = \left(\frac{4}{p+2}\right)^{\frac{1}{p+4}} n^{-\frac{1}{p+4}} \, \, \widehat \sigma_j,\] where \(\widehat\sigma_j\) is the estimated standard deviation for variable \(j\). Use the Nadaraya-Watson kernel estimator to fit and predict the ozone level using time and wind. At the end, report the testing error.

c [5 Points] Variance of Multi-dimensional Kernel

In our lecture, we only introduced the one-dimensional case for density estimations. For a regression problem, the rate is essentially the same. However, when we have multivariate kernels, the rate of bias and variance will change. If we use the same \(\lambda\) from the one-dimensional case in a two-dimensional problem, would you expect the bias to increase/decrease/unchanged? Would you expect the variance to increase/decrease/unchanged? And why? Hint: for the same \(\lambda\), bias is quantified by how far your neighboring points are, and variance is quantified by how many points you are capturing within that neighborhood.

STAT 542: Homework 6

Spring 2022, by Ruoqing Zhu (rqzhu)

Due: Monday, March 10, 11:59 PM CT