Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.Load the MNIST data and take just digits 1, 6, and 7 in the first 1000 observations as the training data. Do the same for the second 1000 observations as the testing data.
# Load the data
load("mnist_first2000.RData")
dim(mnist)
## [1] 2000 785
[5 pts] Different from HW9, we will only perform PCA on the
training data, and then apply learned the rotation matrix on the testing
data. Carry this out using the prcomp()
function and the
associated predict()
function and extract the first 20 PCs
for both training and testing data. Use centering but not scaling when
you perform the PCA. Comment on why this is OK for this
dataset.
[25 pts] Now lets write a version of our own \(K\) Means algorithm using the training data. Keep in mind that the idea of \(K\) Means is
After finishing the algorithm, compare your clustering result (labels) to the true digit labels (which you didn’t observe). Are they similar? Can you explain what you see?
kmeans()
function to cluster
the training data. Use nstart
\(=
20\). Compare your result to the one you got in part b. Are they
the same (they could be slightly different)? Can you explain the result
if they are not completely the same?golub
dataTake the golub
data from HW3 and 4, let’s perform some
clustering on it. Remember that we will only use the gene expression
part of the data, but later on, compare our clustering with the true
class labels (golub.cl
).
[30 pts] Perform hierarchical clustering on the data using the
hclust()
function.
single
,
complete
, and average
.[20 pts] Perform spectral clustering using the following steps:
FNN
function.heatmap()
, what
information do you see?[10 pts] Perform UMAP on the data and experiment with number of nearest neighbors \(k = 4\). Provide necessary plots and tables to show your results, and comment on your findings.