Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:
Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.
HWx_yourNetID.pdf. For example,
HW01_rqzhu.pdf. Please note that this must be a
.pdf file. .html format will not be
accepted because they are often not readable on gradescope.
Make all of your R code chunks visible for
grading.R is \(\geq 4.0.0\). This
will ensure your random seed generation is the same as everyone
else..Rmd file
as a template, be sure to remove this instruction
section.In our lecture, we demonstrated an example of clustering pixels in an image. In this question, you will replicate that procedure using your favorite image. To complete this question, perform the following steps:
Load the mtcars dataset. We will perform hierarchical
clustering on this dataset using a distance metric called the Manhattan
distance (also known as the \(L_1\)
norm). You should construct a distance matrix \(D_{n \times n}\), where the \((i, j)\)th element represents the distance
between observations \(i\) and \(j\), defined as
\[ d(\mathbf{x}_i, \mathbf{x}_j) = \lVert \mathbf{x}_i - \mathbf{x}_j \rVert_1. \]
This can be done using the dist() function, but you need
to read the documentation carefully [link].
Try the complete and ward.D2 linkage methods.
Select one final clustering result. Since you know the meaning of these
variables, you can make the judgment based on how you think these
clusters represent the data. If you believe some variables are more/less
important than others, you can consider rescale them and redo the
analysis. This is an open-ended question, but you should provide a
reasonable explanation for your choice. Provide both numerical and
graphical results to support your conclusion.
Load our MNIST hand written digit data from previous homework. Take 200 observations each from the digits 0, 2, and 5 (total 600 observations). Perform spectral clustering on this dataset by following these steps:
Use the umap package in R to perform UMAP
on all the first 1000 observations of the MNIST handwritten digit
dataset. Use the default settings of the umap function.
Plot the two-dimensional representation of the data points and color
them based on their true digit labels. Briefly comment on whether the
clustering results are good. What is your criterion for “good”
clustering here? Can you tune the parameters to improve the results? Ask
ChatGPT for ideas of tuning if you do not know where to start.