Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: \(K\) Means Clustering

Load the MNIST data and take just digits 1, 6, and 7 in the first 1000 observations as the training data. Do the same for the second 1000 observations as the testing data.

  # Load the data
  load("mnist_first2000.RData")
  dim(mnist)
## [1] 2000  785
  1. [5 pts] Different from HW9, we will only perform PCA on the training data, and then apply learned the rotation matrix on the testing data. Carry this out using the prcomp() function and the associated predict() function and extract the first 20 PCs for both training and testing data. Use centering but not scaling when you perform the PCA. Comment on why this is OK for this dataset.

  2. [25 pts] Now lets write a version of our own \(K\) Means algorithm using the training data. Keep in mind that the idea of \(K\) Means is

    1. Randomly assign each observation to one of the K clusters
    2. Compute the centroid of each cluster
    3. Assign each observation to the cluster with the closest centroid (using Euclidean distance)
    4. Repeat steps 2 and 3 until the clusters stop changing (or until some maximum number of iterations is reached)
    5. Return the final cluster assignments

After finishing the algorithm, compare your clustering result (labels) to the true digit labels (which you didn’t observe). Are they similar? Can you explain what you see?

  1. [10 pts] Use the default kmeans() function to cluster the training data. Use nstart \(= 20\). Compare your result to the one you got in part b. Are they the same (they could be slightly different)? Can you explain the result if they are not completely the same?

Question 2: Clustering of golub data

Take the golub data from HW3 and 4, let’s perform some clustering on it. Remember that we will only use the gene expression part of the data, but later on, compare our clustering with the true class labels (golub.cl).

  1. [30 pts] Perform hierarchical clustering on the data using the hclust() function.

    • Try three different methods: single, complete, and average.
    • Plot the dendrogram and choose cut-off points and obtain the cluster labels (not necessary 2) that you think reasonable. Provide explanation on how you choose the cut-off points.
    • Comment on the difference between the three methods.
    • Which one would you prefer given that you know the true class labels?
    • If you are asked to single out some observations that do not seem to fit into the patterns of the rest of the data, which observation would you pick? After removing this observation, how does your clustering result change?
  2. [20 pts] Perform spectral clustering using the following steps:

    • Compute the similarity matrix using the 4 nearest neighbors. You can do this by using the FNN function.
    • Plot the adjacency matrix using heatmap(), what information do you see?
    • Compute the Laplacian matrix and perform eigen-decomposition to extract top two embedded features
    • Perform \(K\)-Means clustering on the embedded features with \(k = 2\)
    • Provide necessary plots and tables to show your results, and comment on your findings.
  3. [10 pts] Perform UMAP on the data and experiment with number of nearest neighbors \(k = 4\). Provide necessary plots and tables to show your results, and comment on your findings.