Instruction

Please remove this section when submitting your homework.

Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.

Question 1: \(K\) Means Clustering

Load the MNIST data and take just digits 1, 6, and 7 in the first 1000 observations as the training data. Do the same for the second 1000 observations as the testing data.

  # Load the data
  load("mnist_first2000.RData")
  dim(mnist)
## [1] 2000  785
  1. Different from HW9, we will only perform PCA on the training data, and then apply learned the rotation matrix on the testing data. Carry this out using the prcomp() function and the associated predict() function and extract the first 20 PCs for both training and testing data. Use centering but not scaling when you perform the PCA. Comment on why this is OK for this dataset.
  # get trainning data in the first 1000 observations
  train = mnist[1:1000, ]
  test = mnist[1001:2000, ]

  train = train[train$Digit == 1 | train$Digit == 6 | train$Digit == 7, ]
  test = test[test$Digit == 1 | test$Digit == 6 | test$Digit == 7, ]

  # perform PCA on the training data
  pca = prcomp(train[, -1], center = TRUE, scale. = FALSE)
  train_pca = data.matrix(pca$x[, 1:20])
  test_pca = data.matrix(predict(pca, test[, -1])[, 1:20])

It is ok to not scale the data because the pixels have the same definition in terms of their scales.

  1. Now lets write a version of our own \(K\) Means algorithm using the training data. Keep in mind that the idea of \(K\) Means is

    1. Randomly assign each observation to one of the K clusters
    2. Compute the centroid of each cluster
    3. Assign each observation to the cluster with the closest centroid (using Euclidean distance)
    4. Repeat steps 2 and 3 until the clusters stop changing (or until some maximum number of iterations is reached)
    5. Return the final cluster assignments

After finishing the algorithm, compare your clustering result (labels) to the true digit labels (which you did’t observe). Are they similar? Can you explain what you see?

  # write your own K Means algorithm
  # randomly assign each observation to one of the K clusters
  k = 3
  clusters = sample(1:k, nrow(train_pca), replace = TRUE)

  # performs loop until the clusters stop changing
  for (i in 1:100)
  {
    # compute the centroid of each cluster
    centroids = t(data.matrix(aggregate(train_pca, list(clusters), mean)[, -1]))

    # calculate the distance of each observation to each centroid
    alldist = matrix(NA, nrow = nrow(train_pca), ncol = k)

    for (j in 1:k)
    {
      alldist[, j] = rowSums( ( sweep(train_pca, 2, centroids[, j], "-") )^2 )
    }

    # assign each observation to the cluster with the closest centroid
    new_clusters = apply(alldist, 1, which.min)

    # check stopping condition
    if (all(clusters == new_clusters))
    {
      break
    }else {
       clusters = new_clusters
    }
  }

  # compute the centroid of each cluster
  clusters
##   [1] 2 2 2 1 2 3 1 2 1 2 1 1 3 1 2 3 3 2 1 1 2 2 3 2 1 2 2 3 1 3 1 3 1 3 2 3 2
##  [38] 3 2 2 1 2 2 3 2 1 2 1 3 3 2 1 3 1 2 1 3 1 2 2 2 2 3 1 3 3 2 2 1 2 2 2 2 3
##  [75] 1 1 3 3 3 2 1 1 3 2 2 2 3 1 3 2 2 2 2 1 3 2 1 3 1 3 3 2 2 1 2 1 3 1 3 1 2
## [112] 3 2 3 2 2 2 2 1 2 3 3 2 3 3 1 2 1 2 2 1 2 2 3 2 1 3 3 1 1 3 2 3 2 1 2 2 2
## [149] 1 2 3 3 2 2 3 3 2 2 2 2 3 3 1 3 1 2 2 2 2 3 3 3 2 1 2 1 2 1 2 2 2 2 3 1 3
## [186] 2 2 1 3 2 1 2 1 1 3 2 1 3 2 2 3 2 1 3 2 3 2 3 2 2 3 2 2 1 3 3 1 1 1 2 1 2
## [223] 2 1 3 3 2 2 1 3 2 2 1 3 3 2 3 1 1 2 1 2 3 3 1 1 2 3 3 2 2 2 1 3 1 3 3 1 3
## [260] 2 1 1 3 3 2 3 1 2 2 2 3 1 1 1 3 3 2 2 3 1 3 2 3 2 2 3 1 3 2 3 1 3 2 2 2 2
## [297] 3 3 3 2 2 2 3 1 2 3 2 3 1 1 2 2 3 3 3 3 1 1 2 3 2 1 2 3 3 1 1
  table(clusters, train$Digit)
##         
## clusters   1   6   7
##        1   1  85   1
##        2 113   7  15
##        3   2   2 101
  1. Use the default kmeans() function to cluster the training data. Use nstart \(= 20\). Compare your result to the one you got in part b. Are they the same (they could be slightly different)? Can you explain the result if they are not completely the same?
  # use the default kmeans() function
  kmeans_result = kmeans(train_pca, centers = 3, nstart = 20)

  table(kmeans_result$cluster, clusters)
##    clusters
##       1   2   3
##   1   2 135   0
##   2  84   0   0
##   3   1   0 105

Question 2: Clustering of golub data

Take the golub data from HW3 and 4, let’s perform some clustering on it. Remember that we will only use the gene expression part of the data, but later on, compare our clustering with the true class labels (golub.cl).

  1. Perform heriacrchical clustering on the data using the hclust() function.
    • Try three different methods: single, complete, and average.
    • Plot the dendrogram and choose cut-off points and obtain the cluster labels (not necessary 2) that you think reasonable. Provide explaination on how you choose the cut-off points.
    • Comment on the difference between the three methods.
    • Which one would you prefer given that you know the true class labels?
    • If you are asked to single out some observations that do not seem to fit into the patterns of the rest of the data, which observation would you pick? After removing this observation, how does your clustering result change?
  library(multtest)
  data(golub)

  # Extract expression data and class label
  X = t(as.matrix(golub))
  y = as.factor(golub.cl)

  # perform hierarchical clustering
  hclust_single = hclust(dist(X), method = "single")
  hclust_complete = hclust(dist(X), method = "complete")
  hclust_average = hclust(dist(X), method = "average")

  # plot the dendrogram
  plot(hclust_single)

  clusters = cutree(hclust_average, k = 2)
  table(clusters, y)
##         y
## clusters  0  1
##        1 26 11
##        2  1  0
  plot(hclust_complete)

  clusters = cutree(hclust_complete, k = 2)
  table(clusters, y)
##         y
## clusters  0  1
##        1 27  5
##        2  0  6
  plot(hclust_average)

  clusters = cutree(hclust_average, k = 3)
  table(clusters, y)
##         y
## clusters  0  1
##        1 24  0
##        2  2 11
##        3  1  0
  # remove observation 21
  X_sub = X[-21, ]
  y_sub = y[-21]

  hclust_average = hclust(dist(X_sub), method = "average")
  plot(hclust_average)

  clusters = cutree(hclust_average, k = 2)
  table(clusters, y_sub)
##         y_sub
## clusters  0  1
##        1 24  0
##        2  2 11
  # perform and plot PCA on the original data 
  pca = prcomp(X, center = TRUE, scale. = FALSE)
  newcol = c("red", "blue")[as.numeric(y)]
  newcol[21] = "green"
  plot(pca$x[, 1:2], col = newcol, pch = 19)

I removed observation 21 because it seems to be an outlier by the dendrogram. After removing it, the clustering result is much better. Overall, the average method seems to be the best on recovering the true class labels. However, in practice, we would not be able to know the true class labels. For example, if we perform PCA on the original data, we can see that the first PC separate the two classes very well, but observation 21 is not an outlier.

  1. Perform spectral clustering using the following steps:
  library(FNN)
  n = nrow(X)
  W = matrix(0, n, n)
  
  # get neighbor index for each observation
  nn = get.knn(X, k=4)

  # write into W
  for (i in 1:n)
    W[i, nn$nn.index[i, ]] = 1
  
  # W is not necessary symmetric
  W = 0.5*(W + t(W))

  # plot the adjacency matrix
  heatmap(W, Rowv = NA, Colv=NA, symm = TRUE, revC = TRUE) 

  # compute the degree of each vertex
  d = colSums(W)
  
  # the laplacian matrix
  L = diag(d) - W
  
  # eigen-decomposition
  f = eigen(L, symmetric = TRUE)
  xembed = f$vectors[, (length(f$values)-2) : (length(f$values)-1) ]

  scfit = kmeans(xembed, centers = 2, nstart = 20)
  plot(xembed, col = c("red", "blue")[as.numeric(y)], pch = 19)

  table(scfit$cluster, y)
##    y
##      0  1
##   1  2 11
##   2 25  0

From the heatmap, we can see that the observations are clustered into two groups, very much aligned with the true class label. The spectral clustering result is also very similar to the true clustering labels. We should note that all of these results are still based on Euclidean distance, which is likely the cause of their similarity.

  1. Perform UMAP on the data and experiment with number of nearest neighbors \(k = 4\). Provide necessary plots and tables to show your results, and comment on your findings.
  library(umap)
  myumap.tuning = umap.defaults
  umap.defaults$n_neighbors = 4

  ## Warning: package 'umap' was built under R version 4.2.2
  golub.umap = umap(X)
  golub.umap
## umap embedding of 38 items in 2 dimensions
## object components: layout, data, knn, config
  plot(golub.umap$layout, col = c("red", "blue")[as.numeric(y)], pch = 19)

The UMAP result is similar to others and be able to almost perfectly separate the two classes. It is based on a different embedding method.