Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, there are two basic rules:

Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. Please refer to the course website for late submission policy and grading rubrics.

Question 1: K-Means [25 pts]

In our lecture, we demonstrated an example of clustering pixels in an image. In this question, you will replicate that procedure using your favorite image. To complete this question, perform the following steps:

Question 2: Hierarchical Clustering [25 pts]

Load the mtcars dataset. We will perform hierarchical clustering on this dataset using a distance metric called the Manhattan distance (also known as the \(L_1\) norm). You should construct a distance matrix \(D_{n \times n}\), where the \((i, j)\)th element represents the distance between observations \(i\) and \(j\), defined as

\[ d(\mathbf{x}_i, \mathbf{x}_j) = \lVert \mathbf{x}_i - \mathbf{x}_j \rVert_1. \]

This can be done using the dist() function, but you need to read the documentation carefully [link]. Try the complete and ward.D2 linkage methods. Select one final clustering result. Since you know the meaning of these variables, you can make the judgment based on how you think these clusters represent the data. If you believe some variables are more/less important than others, you can consider rescale them and redo the analysis. This is an open-ended question, but you should provide a reasonable explanation for your choice. Provide both numerical and graphical results to support your conclusion.

Question 3: Spectral Clustering [25 pts]

Load our MNIST hand written digit data from previous homework. Take 200 observations each from the digits 0, 2, and 5 (total 600 observations). Perform spectral clustering on this dataset by following these steps:

Question 4: UMAP [25 pts]

Use the umap package in R to perform UMAP on all the first 1000 observations of the MNIST handwritten digit dataset. Use the default settings of the umap function. Plot the two-dimensional representation of the data points and color them based on their true digit labels. Briefly comment on whether the clustering results are good. What is your criterion for “good” clustering here? Can you tune the parameters to improve the results? Ask ChatGPT for ideas of tuning if you do not know where to start.