Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to compass2g. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
What is expected for the submission to Gradescope
HWx_yourNetID.pdf
. For example, HW01_rqzhu.pdf
. Please note that this must be a .pdf
file generated by a .Rmd
file. .html
format cannot be accepted.Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
Ruoqing Zhu (rqzhu)
by your name and NetID if you are using this template).R
code chunks visible for grading.R
code chunks that support your answers.Answer: I fit SVM with the following choice of tuning parameters ...
Requirements regarding the .Rmd
file.
Rmd
files. However, your PDF file should be rendered directly from it.When fitting linear regressions, outliers could significantly affect the fitting results. However, manually checking and removing outliers can be tricky and time consuming. Some regression methods address this problem by using a more robust loss function. For example, one such regression is to minimize the objective function
\[ \frac{1}{n} \sum_{i=1}^n \ell_\delta(y_i - x_i^\text{T} \boldsymbol \beta), \] where the loss function \(\ell_{\delta}\) is the Huber Loss, defined as \[ \ell_\delta( a ) = \begin{cases} \frac{1}{2} a^2 & \quad \text{if } |a| \leq \delta \\ \delta(|a| - \frac{1}{2} \delta) & \quad \text{o.w.} \end{cases} \] Here is a visualization that compares Huber loss with the \(\ell_2\) loss. We can see that the Huber loss assigns much less value when \(y_i - x_i^\text{T} \boldsymbol \beta\) is more extreme (outliers).
# define the Huber loss function
Huber <- function(a, delta = 1) ifelse(abs(a) <= delta, 0.5*a^2, delta*( abs(a) - 0.5*delta))
# plot against L2
x = seq(-4, 4, 0.01)
plot(x, Huber(x), type = "l",
xlab = "a", ylab = "Huber Loss",
col = "darkorange", ylim = c(0, 8))
lines(x, 0.5*x^2, col = "deepskyblue")
legend(x = 0, y = 8, legend = c("Huber Loss", "OLS loss"),
col = c("darkorange", "deepskyblue"), lty = 1)
Use the following code to generate
# generate data from a simple linear model
set.seed(542)
n = 150
x = runif(n)
X = cbind(1, x)
y = X %*% c(0.5, 1) + rnorm(n)
# create an outlier
y[which.min(X[, 2])] = -30
lm.fit = lm(y ~ X - 1)
lm.fit
##
## Call:
## lm(formula = y ~ X - 1)
##
## Coefficients:
## X Xx
## -0.1254 1.9037
Answer:
We should expect the outlier to pull the slope up since the outliers is well below the regression line and also at the far left hand side of the \(X\) value. To compensate for this huge loss, the regression line has to be tilted down towards on the left, which makes the slope increasing.
huberLoss(b, trainX, trainY)
given a set of observed data with tuning parameter \(\delta = 1\). Here, b
is a \(p\)-dim parameter vector, trainX
is a \(n \times p\) design matrix and \(trainY\) is the outcome. This function should return a scalar as the empirical loss. You can use our Huber
function in your own code. After defining this loss function, use the optim()
function to solve the parameter estimates. Finally, report your coefficients.
b = (0, 0)
as the initial value.BFGS
as the optimization method.Answer:
huberLoss <- function(b, trainx, trainy) mean(Huber(trainy - trainx %*% b, delta = 1))
huberResult <- optim(par = c(0, 0), fn = huberLoss,
method = "BFGS", trainx = X, trainy = y)
huberResult$par
## [1] 0.7545940 0.6223551
[20 pts] We still do not know which method performs better in this case. Let’s use a simulation study to compare the two methods. Complete the following
y[which.min(X[, 2])] = -30
.Answer:
We setup the simulation:
set.seed(1)
N = 1000
olsSlope = rep(0, N)
huberSlope = rep(0, N)
for (i in 1:N){
# generate data with outliers
n = 150
X = cbind(1, runif(n))
y = X %*% c(0.5, 1) + rnorm(n)
y[which.min(X[, 2])] = -30
# fit both models and record the estimated slope
lmModel = lm(y ~ X - 1)
olsSlope[i] = lmModel$coefficients[2]
huberResult <- optim(par = c(0, 0), fn = huberLoss,
method = "BFGS", trainx = X, trainy = y)
huberSlope[i] = huberResult$par[2]
}
plotData = data.frame("OLS" = olsSlope,
"Huber" = huberSlope)
boxplot(plotData)
olsBias = mean(olsSlope) - 1
huberBias = mean(huberSlope) - 1
sprintf("Bias of the intercept term estimator. OLS: %.3f, Huber: %.3f",
olsBias, huberBias)
## [1] "Bias of the intercept term estimator. OLS: 1.241, Huber: 0.081"
The bias is shown as above. Linear model with square error is biased upwards. This is the same as our previous expectation. However, the model with Huber’s loss has little bias. Huber’s loss becomes linear instead of quadratic for large deviated samples. Thus, the contributed from this one outlier does not have significant impact on the estimation.
Scaling issue In the practice, we usually standardize each covariate/feature to mean 0 and standard deviation 1. Standardization is essential when we apply \(\ell_2\) and \(\ell_1\) penalty on the loss function, because if the covariates are with different scales, then they are penalized differently. Without prior information, we should prevent that from happening. Besides, scaling the data also help to make the optimization more stable, since the step size in many descent algorithms could be affected by the scale.
In practice, after obtaining the coefficients fitted with scaled data, we want to recover the original coefficients of the unscaled data. For this question, we use the following intuition:
\[\begin{align} \frac{Y - \bar{Y}}{\text{sd}_y} =&~ \sum_{j=1}^p \frac{X_j - \bar{X}_j}{\text{sd}_j} \gamma_j \\ Y =&~ \underbrace{\bar{Y} - \sum_{j=1}^p \bar{X}_j \frac{\text{sd}_y \cdot \gamma_j}{\text{sd}_j}}_{\beta_0} + \sum_{j=1}^p X_j \underbrace{\frac{\text{sd}_y \cdot \gamma_j}{\text{sd}_j}}_{\beta_j}, \end{align}\]
Based on this relationship, we perform the following when fitting a linear regression:
Use the following code to generate your data:
library(MASS)
set.seed(10)
n = 20
p = 3
# covariance matrix
V = matrix(0.3, p, p)
diag(V) = 1
# generate data
X_org = as.matrix(mvrnorm(n, mu = rep(0, p), Sigma = V))
true_b = c(1, 2, 0)
y_org = X_org %*% true_b + rnorm(n)
Y_org
and X_org
by lm()
. Also, fit another OLS with scaled data by lm()
. Report the coefficients/parameters. Then, transform coefficients from the second approach back to its original scale, and match with the first approach. Summarize your results in a single table: The rows should contain three methods: OLS, OLS Scaled, and OLS recovered, and there should be four columns that represents the coefficients for each method. You can consider using the kable
function, but it is not required.Answer:
We follow the given formula of the scaled method.
X = scale(X_org)*sqrt(n/(n-1))
y = scale(y_org)*sqrt(n/(n-1))
# fit with oringinal data
lmOrg = lm(y_org ~ X_org)
coefOrg = lmOrg$coefficients
# fit with scaled data (no intercept term)
lmScale = lm(y ~ X - 1)
coefScale = lmScale$coefficients
# sqrt(n/(n-1)) is cancelled for sdX / sdY
sdX = apply(X_org, 2, sd)
sdY = sd(y_org)
meanX = apply(X_org, 2, mean)
meanY = mean(y_org)
betaRecover = coefScale * sdY / sdX
beta0Recover = meanY - sum(meanX * coefScale * sdY /sdX)
# print a table
CoefTab = rbind(coefOrg, c(0, coefScale), c(beta0Recover, betaRecover))
rownames(CoefTab) = c("OLS_org", "OLS_scale", "OLS_recover")
colnames(CoefTab) = paste("coef", 0:p, sep = "_")
knitr::kable(CoefTab, digits = 3)
coef_0 | coef_1 | coef_2 | coef_3 | |
---|---|---|---|---|
OLS_org | -0.517 | 0.659 | 2.183 | 0.250 |
OLS_scale | 0.000 | 0.231 | 0.811 | 0.068 |
OLS_recover | -0.517 | 0.659 | 2.183 | 0.250 |
Instead of using the lm()
function, write your own coordinate descent code to solve the scaled problem. This function will be modified and used next week when we code the Lasso method. Complete the following steps:
[10 pts] i) Given the loss function \(L(\beta) = \| y - X\beta\|^2\) or \(\sum_{i=1}^n (y_i - \sum_{j=0}^p x_{ij} \beta_j)^2\), derive the updating/calculation formula of coefficient \(\beta_j\), when holding all other covariates fixed. You must use LaTex to typeset your derivation with proper explaination of notations. Write down the formula (in terms of \(y\), \(x\) and \(\beta\)’s) of residual \(r\) before and after this update. Based on our lecture, how to make the update of \(r\) computationally efficient?
[30 pts] ii) Implement this coordinate descent method with your own code to solve OLS with the scaled data. Print and report your scaled coefficients (no need to recover the original version) and compare with the result from the previous question.
tol = 1e-7
where \(\| \cdot \|_1\) is the L1 distance.[5 pts] Make a plot to analyze the convergence of the coordinate descent. On the x-axis, we use the number of iteration. On the y-axis, use \(\log(\text{Loss}_k - \text{trueLoss})\). Here, is the emperical loss based on the true optimizor, which we can simply use the solution from the lm()
function (the scaled version). The \(\text{loss}_k\) is the loss function at the begining of the \(k\)-th iteration (Keep in mind that within each iteration, we will loop over all \(\beta_j\)). If this plot displays a stragiht line, then we say that this algorithm has a linear convergence rate. Of course, this is at the log scale.
Answer:
Denote scalar \(r_{i}\) as \(y_i - \sum_{k \neq j} x_{ik} \beta_k\) or vector \(r = y - X_{(-j)} \beta_{(-j)}\) before/while updating \(\beta_j\).
\[ \frac{\partial L}{ \partial \beta_{j}} = \sum_{i=1}^n x_{ij} (y_i - \sum_{k=0}^p x_{ij} \beta_k) = \sum_{i=1}^n x_{ij} (r_{i} - x_{ij} \beta_j) = 0 \] Solve this equation, we have \[ \beta_{j} = \frac{\sum_{i=1}^n x_{ij} r_i }{\sum_{i=1}^n x_{ij} x_{ij}} = x_j^T r / x_j^T x_j \] At the \(k\)-th iteration, we have the following update \[ \beta_{j}^{(k+1)} = x_j^T r / x_j^T x_j, \\ r = r - x_j^T \beta_{j}^{(k+1)} + x_{j+1}^T \beta_{j+1}^{(k)}, \] \(j+1 = 1\) if \(j = p\).
An alternative updating if we want to make this fast: \[ r = r + x_j^T \beta_{j}^{(k)},\\ \beta_{j}^{(k+1)} = x_j^T r / x_j^T x_j, \\ r = r - x_j^T \beta_{j}^{(k+1)}. \]
LS_CD <- function(X, y, tol = 1e-7, maxitr = 100){
n = nrow(X)
p = ncol(X)
loss = rep(0, p)
beta_old = rep(0, p)
beta_new = beta_old
XcolNorm = colSums(X^2)
# the grand loop
for (k in 1:maxitr){
beta_old = beta_new
r = y - X %*% beta_old
loss[k] = mean(r^2)
# update each beta_j
for (j in 1:p){
# update r and beta
r = r + X[, j] * beta_new[j]
beta_new[j] <- X[, j] %*% r / XcolNorm[j]
r = r - X[, j] * beta_new[j]
}
# check the change of parameters
if (sum(abs(beta_new - beta_old)) < tol) break;
}
return(list(coefs = beta_new, loss = loss[1:k]))
}
# fit the model
myfit = LS_CD(X, y)
coefs = matrix(myfit$coefs, nrow = 1)
rownames(coefs) = c("OLS_scale_CD")
colnames(coefs) = paste("coef", 1:p, sep = "_")
knitr::kable(coefs, digits = 3)
coef_1 | coef_2 | coef_3 | |
---|---|---|---|
OLS_scale_CD | 0.231 | 0.811 | 0.068 |
The coefficients are 0.231, 0.811, 0.068. This matches our results in a). Note that we only perform \(O(n)\) update on the residual, which reduces from the \(O(np)\) cost where we compute \(r = y - X_{(-j)} \beta_{(-j)}\) each time we update any \(\beta_j\).
plot(1:length(myfit$loss), log(myfit$loss - mean(lmScale$residuals^2)),
pch = 19, col = "red",
xlab = "iter", ylab = "log( MSE diff)")