STAT 542: Homework 1 Solution

Instruction
About Homework 1
Question 1: Basic calculus
Question 2: Linear algebra
Question 3: Statistics
Question 4: Programming

Instruction

Students are encouraged to work together on homework. However, sharing, copying, or providing any part of a homework solution or code is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hardcopy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

What is expected for the submission to Gradescope
- You are required to submit one rendered PDF file HWx_yourNetID.pdf. For example, HW01_rqzhu.pdf. Please note that this must be a .pdf file generated by a .Rmd file. .html format cannot be accepted.
- Please follow the instructions on Gradescope to select corresponding PDF pages for each question.
Please note that your homework file is a PDF report instead of a messy collection of R codes. This report should include:
- Your Name and NetID. (Replace Ruoqing Zhu(rqzhu) by your name and NetID if you are using this template).
- Make all of your R code chunks visible for grading.
- Relevant outputs from your R code chunks that support your answers.
- Provide clear answers or conclusions for each question. For example, you could start with Answer: I fit SVM with the following choice of tuning parameters ...
- Many assignments require your own implementation of algorithms. Basic comments are strongly encouraged to explain the logic to our graders. However, line-by-line code comments are unnecessary.
Requirements regarding the .Rmd file.
- You do NOT need to submit Rmd files. However, your PDF file should be rendered directly from it.
- Make sure that you set random seeds for simulation or randomized algorithms so that the results are reproducible. If a specific seed number is not provided in the homework, you can consider using your NetID.
- For some questions, there will be restrictions on what packages/functions you can use. Please read the requirements carefully. As long as the question does not specify such restrictions, you can use anything.

About Homework 1

You will receive 100% score for HW1 as long as you submit a finished version before the deadline. The goal of HW1 is to check your prerequisite knowledge. All of them should be already covered in courses such as Stat 410, Stat 425, and other basic mathematical courses. These concepts and skills will be used extensively in our course.

If you are not familiar with some of these concepts and skills, you are strongly encouraged to review related materials as early preparation for this course.
If you cannot derive the answers (closed-book) on your own even after reviewing them, then STAT432 (Basics of Statistical Learning) might be a better option, which covers similar topics but is less mathematical-oriented.

In addition, this HW also checks your basic programming knowledge, such as writing a function, random seed, and Latex. Please note that you must type all formulas in Latex form in all future homework. Failing to do so will lead to some penalty.

Question 1: Basic calculus

Calculate the derivative of \(f(x)\)
1. \(f(x) = e^x\)
2. \(f(x) = \log(1 + x)\)
3. \(f(x) = \log(1 + e^x)\)

Answer:

\(f'(x) = e^x\).
\(f'(x) = \frac{1}{1+x}\).
\(f'(x) = \frac{d \log(1 + e^x)}{d(1+e^x)} \frac{d(1+e^x)}{dx} = \frac{1}{1+e^x} e^x = \frac{e^x}{1+e^x}\).

Taylor expansion. Let \(f\): \(\mathbb{R} \rightarrow \mathbb{R}\) be a twice differentiable function. Please write down the first three terms of its Taylor expansion at point \(x = 1\).

Answer:

\[ f(x) = f(1) + f'(1) (x-1) + \frac{f''(1)}{2} (x-1)^2 + o((x-1)^2), \] where \(o((x-1)^2)\) is a lower order term of \((x-1)^2\).

For the infinite sum \(\sum_{n=1}^\infty \frac{1}{n^\alpha}\), where \(\alpha\) is a positive real number, give the exact range of \(\alpha\) such that the series converges.

Answer:

\(\alpha > 1\).

Question 2: Linear algebra

What is the eigen-decomposition of a real symmetric matrix \(A_{n \times n}\)? Write down one form of that decomposition and explain each term in your formula. Based on these terms, suppose all eigenvalues are positive, derive \(A^{-1/2}\).

Answer:

We provide one form of eigen-decomposition. This decomposition is not unique. \[A = U \Lambda U^T = \sum_{i = 1}^{n} \lambda_i u_i u_i^T.\]

Here \(U = (u_1, ..., u_n)\) and \(\Lambda\) is a diagonal matrix that \(\Lambda_{ii} = \lambda_i\). \(u_i^T u_j = 1\) if \(i = j\) and \(u_i^T u_j = 0\) if \(i \neq j\) so \(U^T U = I\).

\[ A^{-1} = U \Lambda^{-1} U^T = U \Lambda^{-1/2} \Lambda^{-1/2} U^T = (U \Lambda^{-1/2} U^T) (U \Lambda^{-1/2} U^T). \] Hence, \(A^{-1/2} = U \Lambda^{-1/2} U^T\), where \(\Lambda^{-1/2}_{ii} = \sqrt{\lambda_i}^{-1}\).

What is a symmetric positive definite matrix \(A_{n \times n}\)? Give one of the equivalent definitions and explain your notation.

Answer:

For any vector \(x \in \mathbb{R}^n\), \(x^T A x > 0\). Or equivalently, all eigenvalues of \(A\) are positive.

True/False. If you claim a statement is false, explain why. For two real matrices \(A_{m \times n}\) and \(B_{n \times m}\)
1. Rank\((A)\) = \(\min\{m, n\}\)
2. If \(m = n\), then trace\((A)\) = \(\sum_{i=1}^n A_{ii}\)
3. If \(A\) is a symmetric matrix, then all eigenvalues of \(A\) are real
4. If \(A\) is a symmetric matrix, \(\lambda_1\) and \(\lambda_2\) are two distinct eigen-values and \(v_1\),\(v_2\) are the corresponding eigen-vectors, then it is possible that \(v_1^T v_2 > 0\).
5. If \(A\) is a symmetric matrix, \(v_1\),\(v_2\) are two distinct eigen-vectors of \(\lambda\), then it is possible that \(v_1^T v_2 > 0\).
6. trace(ABAB) = trace(AABB).

Answer:

False. Rank\((A) \leq \min\{m, n\}\) because \(A\) might not be full column/row ranked.
True. (This is the definition of matrix trace.)
True.
False. \(v_1^T v_2 =0\) always holds.
True. It is allowed unless we perform some orthogonality algorithm.
False. \(AA\) does not make sense if \(m \neq n\).

Question 3: Statistics

\(X_1\), \(X_2\), \(\ldots\), \(X_n\) are i.i.d. \({\cal N}(\mu, \sigma^2)\) random variables, where \(\mu \in \mathbb{R}\) and \(\sigma > 0\) is finite. Let \(\bar{X}_n = \frac{1}{n} \sum_{i=1}^n X_i\).
1. What is an unbiased estimator? Is \(\bar{X}_n\) an unbiased estimator of \(\mu\)?
2. What is \(E[(\bar{X}_n)^2]\) in terms of \(n, \mu, \sigma\)?
3. Give an unbiased estimator of \(\sigma^2\).
4. What is a consistent estimator? Is \(\bar{X}_n\) a consistent estimator of \(\mu\)?

Answer:

\(\hat \theta\) is an unbiased estimator of \(\theta\) if \(E(\hat \theta) = \theta\). Yes, \(E(\bar X_n) = E(X_1) = \mu\).
\(E[(\bar{X}_n)^2] = Var(\bar{X}_n) + E^(\bar{X}_n) = \sigma^2/n + \mu^2\).
\(\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X}_n)^2\). Note that the denominator is \(n-1\) instead of \(n\),
\(\hat \theta\) is a consistent estimator of \(\theta\) if \(\hat \theta \xrightarrow{P} \theta\) when \(n \to \infty\), where \(\xrightarrow{P}\) means convergence in probability. A sufficient condition is \(E(|\hat \theta - \theta|^2) \xrightarrow{} 0\) when \(n \to \infty\). \(\bar{X}_n\) is a consistent estimator of \(\mu\). Because \[ E(|\bar{X}_n - \mu|^2) = Var(\bar{X}_n) + E^2(\bar{X}_n - \mu) = \sigma^2 / n + 0 \xrightarrow{n \to \infty} 0. \] Here we have already shown that \(E^2(\bar{X}_n - \mu)= 0\) in a).

Suppose \(X_{p \times 1}\) is a vector of covariates, \(\beta_{p \times 1}\) is a vector of unknown parameters, \(\epsilon\) is the unobserved random noise and we assume the linear model relationship \(y = X^T \beta + \epsilon\). Suppose we have \(n\) i.i.d. samples from this linear model, and the observed data can be written using the matrix form: \(\mathbf{y}_{n \times 1} = \mathbf{X}_{n\times p} \beta_{p \times 1} + \boldsymbol \epsilon_{n \times 1}\).
1. If we want to estimate the unknown \(\beta\) using a least square method, what is the objective/loss function \(L(\beta)\) to obtain \(\widehat \beta\)?
2. What is the solution of \(\widehat \beta\)? Represent the solution using the observed data \(\mathbf{y}\) and \(\mathbf{X}_{n\times p}\). Note that you may assume that \(\mathbf{X}^T \mathbf{X}\) is invertible.

Answer:

The loss function is the sum of residuals: \(L(\beta) = \| \mathbf{y} - \mathbf{X} \mathbf{\beta}\|^2 = \mathbf{\beta}^T \mathbf{X}^T \mathbf{X} \mathbf{\beta} - 2\mathbf{y}^T \mathbf{X} \mathbf{\beta} + \mathbf{y}^T\mathbf{y}\). In some literature, there my be a \(\frac{1}{2}\) or \(\frac{1}{n}\) factor in \(L(\beta)\).
\(\hat \beta = (X^T X)^{-1} X^T Y\). This can be derived by minimizing \(L(\beta)\) with vector differential rules: \(d(\alpha^T \beta)/d\beta = \alpha\) and \(d(\beta^T A \beta)/d\beta = (A + A^T) \beta\), where \(\beta\) is a vector and \(A\) is a square matrix.

Question 4: Programming

Use the following code to generate a set of \(n\) observations \(\mathbf{y}_{n \times 1}\) and \(\mathbf{X}_{n\times p}\). Follow the previously established formula to solve for the least square estimator \(\widehat \beta\). Note that you must write your own code, instead of using existing functions such as lm(). In addition, what should you do if you are asked to add an intercept term \(\beta_0\) into your estimation (even the true \(\beta_0 = 0\) in our data generator)?

Answer:

First, we generate data for a linear regression.

set.seed(1)
n = 100; p = 5
X = matrix(rnorm(n * p), n, p)
y = X %*% c(1, 0, 0, 1, -1) + rnorm(n)

# to calculate (X^T X)^-1 X^T y
beta_hat = solve(t(X) %*% X, t(X) %*% y)
print(beta_hat)

##             [,1]
## [1,]  0.88369785
## [2,] -0.01725878
## [3,] -0.02658416
## [4,]  0.90233834
## [5,] -1.05639331

We calculate \(\hat \beta\) using the formula \(\hat \beta = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T Y\). %*% is matrix multiplication and t() is transpose in R.

Actually, to reduce the numerical errors, lm function uses a smarter approach, i.e., the QR decomposition on \(\mathbf{X}\), avoiding calculating \((\mathbf{X}^T \mathbf{X})^{-1}\).

If the intercept term is required, we can use cbind(rep(1, n), X) to create a new design matrix \(\mathbf{X}\).

Perform a simulation study to check the consistency of a sample mean estimator \(\bar{X}_n\). Please save your random seed so that the results can be replicated by others.
1. Generate a set of \(n = 20\) i.i.d. observations from uniform (0, 1) distribution and calculate the sample mean \(\bar{X}_n\)
2. Repeat step (a) 1000 times to collect 1000 such sample means and plot them using a histogram.
3. How many of such sample means (out of 1000) are at least 0.1 away from true mean parameter, which is 0.5 for uniform (0, 1)?
4. Repeat steps (a) to (c) with \(n = 100\) and \(n = 500\). What conclusion can you make?

Answer:

The basic code to generate 20 i.i.d. uniform observations, and calculate their mean value.

set.seed(1)
x = runif(20)
x_bar = mean(x)
print(x_bar)

## [1] 0.5551671

We generate \(1000 \times 10\) random variables. Each simulation is a row in matrix X.

set.seed(1)
n = 20
N = 1000
X = matrix(runif(n * N), nrow = N, ncol = n)
X_bar = rowMeans(X)
hist(X_bar, main = "n=20", xlim = c(0.25, 0.75))

mean(abs(X_bar - 0.5) > 0.1)

## [1] 0.143

The percentage is 0.143.

set.seed(1)
n = 50
N = 1000
X = matrix(runif(n * N), nrow = N, ncol = n)
X_bar = rowMeans(X)
hist(X_bar, main = "n=50", xlim = c(0.25, 0.75))

mean(abs(X_bar - 0.5) > 0.1)

## [1] 0.016

set.seed(1)
n = 200
N = 1000
X = matrix(runif(n * N), nrow = N, ncol = n)
X_bar = rowMeans(X)
hist(X_bar, main = "n=200", xlim = c(0.25, 0.75))

mean(abs(X_bar - 0.5) > 0.1)

## [1] 0

As \(n\) increases to 50 and 200, the percentage of outliers decreases to 0.016 and 0, respectively. We can observe that \(\bar{X}_n\) becomes more and more concentrated around its expectation 0.5 as \(n\) grows, which guaranteed by the Law of Large Numbers.

STAT 542: Homework 1 Solution

Spring 2021, by Ruoqing Zhu (rqzhu)

Due: Tuesday 11:59 PM CT, Feb 2nd

Instruction

About Homework 1

Question 1: Basic calculus

Question 2: Linear algebra

Question 3: Statistics

Question 4: Programming