Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
What is expected for the submission to Gradescope
HWx_yourNetID.pdf
. For example, HW01_rqzhu.pdf
. Please note that this must be a .pdf
file generated by a .Rmd
file. .html
format cannot be accepted.Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
Ruoqing Zhu(rqzhu)
by your name and NetID if you are using this template).R
code chunks visible for grading.R
code chunks that support your answers.Answer: I fit SVM with the following choice of tuning parameters ...
Requirements regarding the .Rmd
file.
Rmd
files. However, your PDF file should be rendered directly from it.You will receive 100% score for HW1 as long as you submit a finished version before the deadline. The goal of HW1 is to check your prerequisite knowledge. All of them should be already covered in courses such as Stat 410, Stat 425, and other basic mathematical courses. These concepts and skills will be used extensively in our course.
In addition, this HW also checks your basic programming knowledge, such as writing a function, random seed, and Latex. Please note that you must type all formulas in Latex form in all future homework. Failing to do so will lead to some penalty.
Calculate the derivative of \(f(x)\)
Answer:
\(f'(x) = e^x\).
\(f'(x) = \frac{1}{1+x}\).
\(f'(x) = \frac{d \log(1 + e^x)}{d(1+e^x)} \frac{d(1+e^x)}{dx} = \frac{1}{1+e^x} e^x = \frac{e^x}{1+e^x}\).
Answer:
\[ f(x) = f(1) + f'(1) (x-1) + \frac{f''(1)}{2} (x-1)^2 + o((x-1)^2), \] where \(o((x-1)^2)\) is a lower order term of \((x-1)^2\).
Answer:
\(\alpha > 1\).
Answer:
We provide one form of eigen-decomposition. This decomposition is not unique. \[A = U \Lambda U^T = \sum_{i = 1}^{n} \lambda_i u_i u_i^T.\]
Here \(U = (u_1, ..., u_n)\) and \(\Lambda\) is a diagonal matrix that \(\Lambda_{ii} = \lambda_i\). \(u_i^T u_j = 1\) if \(i = j\) and \(u_i^T u_j = 0\) if \(i \neq j\) so \(U^T U = I\).
\[ A^{-1} = U \Lambda^{-1} U^T = U \Lambda^{-1/2} \Lambda^{-1/2} U^T = (U \Lambda^{-1/2} U^T) (U \Lambda^{-1/2} U^T). \] Hence, \(A^{-1/2} = U \Lambda^{-1/2} U^T\), where \(\Lambda^{-1/2}_{ii} = \sqrt{\lambda_i}^{-1}\).
Answer:
For any vector \(x \in \mathbb{R}^n\), \(x^T A x > 0\). Or equivalently, all eigenvalues of \(A\) are positive.
True/False. If you claim a statement is false, explain why. For two real matrices \(A_{m \times n}\) and \(B_{n \times m}\)
Answer:
False. Rank\((A) \leq \min\{m, n\}\) because \(A\) might not be full column/row ranked.
True. (This is the definition of matrix trace.)
True.
False. \(v_1^T v_2 =0\) always holds.
True. It is allowed unless we perform some orthogonality algorithm.
False. \(AA\) does not make sense if \(m \neq n\).
\(X_1\), \(X_2\), \(\ldots\), \(X_n\) are i.i.d. \({\cal N}(\mu, \sigma^2)\) random variables, where \(\mu \in \mathbb{R}\) and \(\sigma > 0\) is finite. Let \(\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i\).
Answer:
\(\hat \theta\) is an unbiased estimator of \(\theta\) if \(E(\hat \theta) = \theta\). Yes, \(E(\bar X_n) = E(X_1) = \mu\).
\(E[(\bar{X}_n)^2] = Var(\bar{X}_n) + E^(\bar{X}_n) = \sigma^2/n + \mu^2\).
\(\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}_n)^2\). Note that the denominator is \(n-1\) instead of \(n\),
\(\hat \theta\) is a consistent estimator of \(\theta\) if \(\hat \theta \xrightarrow{P} \theta\) when \(n \to \infty\), where \(\xrightarrow{P}\) means convergence in probability. A sufficient condition is \(E(|\hat \theta - \theta|^2) \xrightarrow{} 0\) when \(n \to \infty\). \(\bar{X}_n\) is a consistent estimator of \(\mu\). Because \[ E(|\bar{X}_n - \mu|^2) = Var(\bar{X}_n) + E^2(\bar{X}_n - \mu) = \sigma^2 / n + 0 \xrightarrow{n \to \infty} 0. \] Here we have already shown that \(E^2(\bar{X}_n - \mu)= 0\) in a).
Suppose \(X_{p \times 1}\) is a vector of covariates, \(\beta_{p \times 1}\) is a vector of unknown parameters, \(\epsilon\) is the unobserved random noise and we assume the linear model relationship \(y = X^T \beta + \epsilon\). Suppose we have \(n\) i.i.d. samples from this linear model, and the observed data can be written using the matrix form: \(\mathbf{y}_{n \times 1} = \mathbf{X}_{n\times p} \beta_{p \times 1} + \boldsymbol \epsilon_{n \times 1}\).
Answer:
The loss function is the sum of residuals: \(L(\beta) = \| \mathbf{y} - \mathbf{X} \mathbf{\beta}\|^2 = \mathbf{\beta}^T \mathbf{X}^T \mathbf{X} \mathbf{\beta} - 2\mathbf{y}^T \mathbf{X} \mathbf{\beta} + \mathbf{y}^T\mathbf{y}\). In some literature, there my be a \(\frac{1}{2}\) or \(\frac{1}{n}\) factor in \(L(\beta)\).
\(\hat \beta = (X^T X)^{-1} X^T Y\). This can be derived by minimizing \(L(\beta)\) with vector differential rules: \(d(\alpha^T \beta)/d\beta = \alpha\) and \(d(\beta^T A \beta)/d\beta = (A + A^T) \beta\), where \(\beta\) is a vector and \(A\) is a square matrix.
lm()
. In addition, what should you do if you are asked to add an intercept term \(\beta_0\) into your estimation (even the true \(\beta_0 = 0\) in our data generator)?Answer:
First, we generate data for a linear regression.
set.seed(1)
n = 100; p = 5
X = matrix(rnorm(n * p), n, p)
y = X %*% c(1, 0, 0, 1, -1) + rnorm(n)
# to calculate (X^T X)^-1 X^T y
beta_hat = solve(t(X) %*% X, t(X) %*% y)
print(beta_hat)
## [,1]
## [1,] 0.88369785
## [2,] -0.01725878
## [3,] -0.02658416
## [4,] 0.90233834
## [5,] -1.05639331
We calculate \(\hat \beta\) using the formula \(\hat \beta = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T Y\). %*%
is matrix multiplication and t()
is transpose in R.
Actually, to reduce the numerical errors, lm
function uses a smarter approach, i.e., the QR decomposition on \(\mathbf{X}\), avoiding calculating \((\mathbf{X}^T \mathbf{X})^{-1}\).
If the intercept term is required, we can use cbind(rep(1, n), X)
to create a new design matrix \(\mathbf{X}\).
Perform a simulation study to check the consistency of a sample mean estimator \(\bar{X}_n\). Please save your random seed so that the results can be replicated by others.
Answer:
set.seed(1)
x = runif(20)
x_bar = mean(x)
print(x_bar)
## [1] 0.5551671
X
.set.seed(1)
n = 20
N = 1000
X = matrix(runif(n * N), nrow = N, ncol = n)
X_bar = rowMeans(X)
hist(X_bar, main = "n=20", xlim = c(0.25, 0.75))
mean(abs(X_bar - 0.5) > 0.1)
## [1] 0.143
The percentage is 0.143.
set.seed(1)
n = 50
N = 1000
X = matrix(runif(n * N), nrow = N, ncol = n)
X_bar = rowMeans(X)
hist(X_bar, main = "n=50", xlim = c(0.25, 0.75))
mean(abs(X_bar - 0.5) > 0.1)
## [1] 0.016
set.seed(1)
n = 200
N = 1000
X = matrix(runif(n * N), nrow = N, ncol = n)
X_bar = rowMeans(X)
hist(X_bar, main = "n=200", xlim = c(0.25, 0.75))
mean(abs(X_bar - 0.5) > 0.1)
## [1] 0
As \(n\) increases to 50 and 200, the percentage of outliers decreases to 0.016 and 0, respectively. We can observe that \(\bar{X}_n\) becomes more and more concentrated around its expectation 0.5 as \(n\) grows, which guaranteed by the Law of Large Numbers.