Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1 [35 Points] Regression and Optimization with Huber Loss

When fitting linear regressions, outliers could significantly affect the fitting results. However, manually checking and removing outliers can be tricky and time consuming. Some regression methods address this problem by using a more robust loss function. For example, one such regression is to minimize the objective function

\[ \frac{1}{n} \sum_{i=1}^n \ell_\delta(y_i - x_i^\text{T} \boldsymbol \beta), \] where the loss function \(\ell_{\delta}\) is the Huber Loss, defined as \[ \ell_\delta( a ) = \begin{cases} \frac{1}{2} a^2 & \quad \text{if } |a| \leq \delta \\ \delta(|a| - \frac{1}{2} \delta) & \quad \text{o.w.} \end{cases} \] Here is a visualization that compares Huber loss with the \(\ell_2\) loss. We can see that the Huber loss assigns much less value when \(y_i - x_i^\text{T} \boldsymbol \beta\) is more extreme (outliers).

  # define the Huber loss function
  Huber <- function(a, delta = 1) ifelse(abs(a) <= delta, 0.5*a^2, delta*( abs(a) - 0.5*delta))
  
  # plot against L2
  x = seq(-4, 4, 0.01)
  plot(x, Huber(x), type = "l",
       xlab = "a", ylab = "Huber Loss",
       col = "darkorange", ylim = c(0, 8))
  lines(x, 0.5*x^2, col = "deepskyblue")
  legend(x = 0, y = 8, legend = c("Huber Loss", "OLS loss"), 
         col = c("darkorange", "deepskyblue"), lty = 1)

Use the following code to generate

  # generate data from a simple linear model 
  set.seed(542)
  n = 150
  x = runif(n)
  X = cbind(1, x)
  y = X %*% c(0.5, 1) + rnorm(n)
  
  # create an outlier
  y[which.min(X[, 2])] = -30
  1. [5 pts] Fit an OLS model with the regular \(\ell_2\) loss. Report your coefficients (do not report other information). Although this is only one set of samples, but do you expect this estimator to be biased based on how we set up the observed data? Do you expect the parameter \(\beta_1\) to bias upwards or downwards? Explain your reason. Hint: is the outlier pulling the regression line slope up or down?
  lm.fit = lm(y ~ X - 1)
  lm.fit
## 
## Call:
## lm(formula = y ~ X - 1)
## 
## Coefficients:
##       X       Xx  
## -0.1254   1.9037

Answer:

We should expect the outlier to pull the slope up since the outliers is well below the regression line and also at the far left hand side of the \(X\) value. To compensate for this huge loss, the regression line has to be tilted down towards on the left, which makes the slope increasing.

  1. [10 pts] Define your own Huber loss function huberLoss(b, trainX, trainY) given a set of observed data with tuning parameter \(\delta = 1\). Here, b is a \(p\)-dim parameter vector, trainX is a \(n \times p\) design matrix and \(trainY\) is the outcome. This function should return a scalar as the empirical loss. You can use our Huber function in your own code. After defining this loss function, use the optim() function to solve the parameter estimates. Finally, report your coefficients.
    • Use b = (0, 0) as the initial value.
    • Use BFGS as the optimization method.

Answer:

  huberLoss <- function(b, trainx, trainy) mean(Huber(trainy - trainx %*% b, delta = 1))
  huberResult <- optim(par = c(0, 0), fn = huberLoss, 
                       method = "BFGS", trainx = X, trainy = y)
  huberResult$par
## [1] 0.7545940 0.6223551
  1. [20 pts] We still do not know which method performs better in this case. Let’s use a simulation study to compare the two methods. Complete the following

    • Set up a simulation for 1000 times. At each time, randomly generate a set of observed data, but also force the outlier with our code y[which.min(X[, 2])] = -30.
    • Fit the regression model with \(\ell_2\) loss and Huber loss, and record the slope variable estimates.
    • Make a side-by-side boxplot to show how these two methods differ in terms of the estimations. Which method seem to have more bias? and report the amount of bias based on your simulation. What can you conclude from the results? Does this match your expectation in part a)? Can you explain this (both OLS and Huber) with the form of loss function, in terms of what their effects are?

Answer:

We setup the simulation:

  set.seed(1)
  N = 1000
  olsSlope = rep(0, N)
  huberSlope = rep(0, N)
  
  for (i in 1:N){
    # generate data with outliers
    n = 150
    X = cbind(1, runif(n))
    y = X %*% c(0.5, 1) + rnorm(n)
    y[which.min(X[, 2])] = -30
    
    # fit both models and record the estimated slope
    lmModel = lm(y ~ X - 1)
    olsSlope[i] = lmModel$coefficients[2]
    
    huberResult <- optim(par = c(0, 0), fn = huberLoss, 
                         method = "BFGS", trainx = X, trainy = y)
    huberSlope[i] = huberResult$par[2]
  }
  plotData = data.frame("OLS" = olsSlope, 
                        "Huber" = huberSlope)
  
  boxplot(plotData)

  olsBias = mean(olsSlope) - 1
  huberBias = mean(huberSlope) - 1
  sprintf("Bias of the intercept term estimator. OLS: %.3f, Huber: %.3f", 
          olsBias, huberBias)
## [1] "Bias of the intercept term estimator. OLS: 1.241, Huber: 0.081"

The bias is shown as above. Linear model with square error is biased upwards. This is the same as our previous expectation. However, the model with Huber’s loss has little bias. Huber’s loss becomes linear instead of quadratic for large deviated samples. Thus, the contributed from this one outlier does not have significant impact on the estimation.

Question 2 [65 Points] Scaling and Coordinate Descent for Linear Regression

Scaling issue In the practice, we usually standardize each covariate/feature to mean 0 and standard deviation 1. Standardization is essential when we apply \(\ell_2\) and \(\ell_1\) penalty on the loss function, because if the covariates are with different scales, then they are penalized differently. Without prior information, we should prevent that from happening. Besides, scaling the data also help to make the optimization more stable, since the step size in many descent algorithms could be affected by the scale.

In practice, after obtaining the coefficients fitted with scaled data, we want to recover the original coefficients of the unscaled data. For this question, we use the following intuition:

\[\begin{align} \frac{Y - \bar{Y}}{\text{sd}_y} =&~ \sum_{j=1}^p \frac{X_j - \bar{X}_j}{\text{sd}_j} \gamma_j \\ Y =&~ \underbrace{\bar{Y} - \sum_{j=1}^p \bar{X}_j \frac{\text{sd}_y \cdot \gamma_j}{\text{sd}_j}}_{\beta_0} + \sum_{j=1}^p X_j \underbrace{\frac{\text{sd}_y \cdot \gamma_j}{\text{sd}_j}}_{\beta_j}, \end{align}\]

Based on this relationship, we perform the following when fitting a linear regression:

Use the following code to generate your data:

  library(MASS)
  set.seed(10)
  n = 20
  p = 3
  
  # covariance matrix
  V = matrix(0.3, p, p)
  diag(V) = 1
  
  # generate data
  X_org = as.matrix(mvrnorm(n, mu = rep(0, p), Sigma = V))
  true_b = c(1, 2, 0)
  y_org = X_org %*% true_b + rnorm(n)
  1. [10 pts] Fit an OLS estimator with the original data Y_org and X_org by lm(). Also, fit another OLS with scaled data by lm(). Report the coefficients/parameters. Then, transform coefficients from the second approach back to its original scale, and match with the first approach. Summarize your results in a single table: The rows should contain three methods: OLS, OLS Scaled, and OLS recovered, and there should be four columns that represents the coefficients for each method. You can consider using the kable function, but it is not required.

Answer:

We follow the given formula of the scaled method.

  X = scale(X_org)*sqrt(n/(n-1))
  y = scale(y_org)*sqrt(n/(n-1))
  
  # fit with oringinal data
  lmOrg = lm(y_org ~ X_org)
  coefOrg = lmOrg$coefficients
  
  # fit with scaled data (no intercept term)
  lmScale = lm(y ~ X - 1)
  coefScale = lmScale$coefficients
  
  # sqrt(n/(n-1)) is cancelled for sdX / sdY
  sdX = apply(X_org, 2, sd)
  sdY = sd(y_org)
  meanX = apply(X_org, 2, mean)
  meanY = mean(y_org)
  
  betaRecover = coefScale * sdY / sdX
  beta0Recover = meanY - sum(meanX * coefScale * sdY /sdX)
  
  # print a table
  CoefTab = rbind(coefOrg, c(0, coefScale), c(beta0Recover, betaRecover))
  rownames(CoefTab) = c("OLS_org", "OLS_scale",  "OLS_recover")
  colnames(CoefTab) = paste("coef", 0:p, sep = "_")
  knitr::kable(CoefTab, digits = 3)
coef_0 coef_1 coef_2 coef_3
OLS_org -0.517 0.659 2.183 0.250
OLS_scale 0.000 0.231 0.811 0.068
OLS_recover -0.517 0.659 2.183 0.250
  1. Instead of using the lm() function, write your own coordinate descent code to solve the scaled problem. This function will be modified and used next week when we code the Lasso method. Complete the following steps:

    • [10 pts] i) Given the loss function \(L(\beta) = \| y - X\beta\|^2\) or \(\sum_{i=1}^n (y_i - \sum_{j=0}^p x_{ij} \beta_j)^2\), derive the updating/calculation formula of coefficient \(\beta_j\), when holding all other covariates fixed. You must use LaTex to typeset your derivation with proper explaination of notations. Write down the formula (in terms of \(y\), \(x\) and \(\beta\)’s) of residual \(r\) before and after this update. Based on our lecture, how to make the update of \(r\) computationally efficient?

    • [30 pts] ii) Implement this coordinate descent method with your own code to solve OLS with the scaled data. Print and report your scaled coefficients (no need to recover the original version) and compare with the result from the previous question.

      • Do not use functions from any additional library.
      • Start with a vector \(\boldsymbol \beta = 0\).
      • Run your coordinate descent algorithm for a maximum of maxitr = 100 iterations (while each iteration will loop through all variables). However, stop your algorithm if the \(\beta\) value of the current iteration is sufficiently similar to the previous one, i.e., \(\|\beta^{(k)}− \beta ^{(k−1)}\|_1 \leq\) tol. Set tol = 1e-7 where \(\| \cdot \|_1\) is the L1 distance.
    • [5 pts] Make a plot to analyze the convergence of the coordinate descent. On the x-axis, we use the number of iteration. On the y-axis, use \(\log(\text{Loss}_k - \text{trueLoss})\). Here, is the emperical loss based on the true optimizor, which we can simply use the solution from the lm() function (the scaled version). The \(\text{loss}_k\) is the loss function at the begining of the \(k\)-th iteration (Keep in mind that within each iteration, we will loop over all \(\beta_j\)). If this plot displays a stragiht line, then we say that this algorithm has a linear convergence rate. Of course, this is at the log scale.

Answer:

Denote scalar \(r_{i}\) as \(y_i - \sum_{k \neq j} x_{ik} \beta_k\) or vector \(r = y - X_{(-j)} \beta_{(-j)}\) before/while updating \(\beta_j\).

\[ \frac{\partial L}{ \partial \beta_{j}} = \sum_{i=1}^n x_{ij} (y_i - \sum_{k=0}^p x_{ij} \beta_k) = \sum_{i=1}^n x_{ij} (r_{i} - x_{ij} \beta_j) = 0 \] Solve this equation, we have \[ \beta_{j} = \frac{\sum_{i=1}^n x_{ij} r_i }{\sum_{i=1}^n x_{ij} x_{ij}} = x_j^T r / x_j^T x_j \] At the \(k\)-th iteration, we have the following update \[ \beta_{j}^{(k+1)} = x_j^T r / x_j^T x_j, \\ r = r - x_j^T \beta_{j}^{(k+1)} + x_{j+1}^T \beta_{j+1}^{(k)}, \] \(j+1 = 1\) if \(j = p\).

An alternative updating if we want to make this fast: \[ r = r + x_j^T \beta_{j}^{(k)},\\ \beta_{j}^{(k+1)} = x_j^T r / x_j^T x_j, \\ r = r - x_j^T \beta_{j}^{(k+1)}. \]

LS_CD <- function(X, y, tol = 1e-7, maxitr = 100){
  n = nrow(X)
  p = ncol(X)
  loss = rep(0, p)
  
  beta_old = rep(0, p)
  beta_new = beta_old
  XcolNorm = colSums(X^2)
  
  # the grand loop 
  for (k in 1:maxitr){
    beta_old = beta_new
    r = y - X %*% beta_old
    loss[k] = mean(r^2)
    
    # update each beta_j 
    for (j in 1:p){
      # update r and beta
      r = r + X[, j] * beta_new[j]
      beta_new[j] <- X[, j] %*% r / XcolNorm[j]
      r = r - X[, j] * beta_new[j]
    }
    # check the change of parameters
    if (sum(abs(beta_new - beta_old)) < tol) break;
  }
  return(list(coefs = beta_new, loss = loss[1:k]))
}
# fit the model
myfit = LS_CD(X, y)
coefs = matrix(myfit$coefs, nrow = 1)
rownames(coefs) = c("OLS_scale_CD")
colnames(coefs) = paste("coef", 1:p, sep = "_")
knitr::kable(coefs, digits = 3)
coef_1 coef_2 coef_3
OLS_scale_CD 0.231 0.811 0.068

The coefficients are 0.231, 0.811, 0.068. This matches our results in a). Note that we only perform \(O(n)\) update on the residual, which reduces from the \(O(np)\) cost where we compute \(r = y - X_{(-j)} \beta_{(-j)}\) each time we update any \(\beta_j\).

plot(1:length(myfit$loss), log(myfit$loss - mean(lmScale$residuals^2)), 
     pch = 19, col = "red",
     xlab = "iter", ylab = "log( MSE diff)")