Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.Load the MNIST data and take just digits 1, 6, and 7 in the first 1000 observations as the training data. Do the same for the second 1000 observations as the testing data.
# Load the data
load("mnist_first2000.RData")
dim(mnist)
## [1] 2000 785
prcomp()
function and the associated
predict()
function and extract the first 20 PCs for both
training and testing data. Use centering but not scaling when you
perform the PCA. Comment on why this is OK for this dataset. # get trainning data in the first 1000 observations
train = mnist[1:1000, ]
test = mnist[1001:2000, ]
train = train[train$Digit == 1 | train$Digit == 6 | train$Digit == 7, ]
test = test[test$Digit == 1 | test$Digit == 6 | test$Digit == 7, ]
# perform PCA on the training data
pca = prcomp(train[, -1], center = TRUE, scale. = FALSE)
train_pca = data.matrix(pca$x[, 1:20])
test_pca = data.matrix(predict(pca, test[, -1])[, 1:20])
It is ok to not scale the data because the pixels have the same definition in terms of their scales.
Now lets write a version of our own \(K\) Means algorithm using the training data. Keep in mind that the idea of \(K\) Means is
After finishing the algorithm, compare your clustering result (labels) to the true digit labels (which you did’t observe). Are they similar? Can you explain what you see?
# write your own K Means algorithm
# randomly assign each observation to one of the K clusters
k = 3
clusters = sample(1:k, nrow(train_pca), replace = TRUE)
# performs loop until the clusters stop changing
for (i in 1:100)
{
# compute the centroid of each cluster
centroids = t(data.matrix(aggregate(train_pca, list(clusters), mean)[, -1]))
# calculate the distance of each observation to each centroid
alldist = matrix(NA, nrow = nrow(train_pca), ncol = k)
for (j in 1:k)
{
alldist[, j] = rowSums( ( sweep(train_pca, 2, centroids[, j], "-") )^2 )
}
# assign each observation to the cluster with the closest centroid
new_clusters = apply(alldist, 1, which.min)
# check stopping condition
if (all(clusters == new_clusters))
{
break
}else {
clusters = new_clusters
}
}
# compute the centroid of each cluster
clusters
## [1] 2 2 2 1 2 3 1 2 1 2 1 1 3 1 2 3 3 2 1 1 2 2 3 2 1 2 2 3 1 3 1 3 1 3 2 3 2
## [38] 3 2 2 1 2 2 3 2 1 2 1 3 3 2 1 3 1 2 1 3 1 2 2 2 2 3 1 3 3 2 2 1 2 2 2 2 3
## [75] 1 1 3 3 3 2 1 1 3 2 2 2 3 1 3 2 2 2 2 1 3 2 1 3 1 3 3 2 2 1 2 1 3 1 3 1 2
## [112] 3 2 3 2 2 2 2 1 2 3 3 2 3 3 1 2 1 2 2 1 2 2 3 2 1 3 3 1 1 3 2 3 2 1 2 2 2
## [149] 1 2 3 3 2 2 3 3 2 2 2 2 3 3 1 3 1 2 2 2 2 3 3 3 2 1 2 1 2 1 2 2 2 2 3 1 3
## [186] 2 2 1 3 2 1 2 1 1 3 2 1 3 2 2 3 2 1 3 2 3 2 3 2 2 3 2 2 1 3 3 1 1 1 2 1 2
## [223] 2 1 3 3 2 2 1 3 2 2 1 3 3 2 3 1 1 2 1 2 3 3 1 1 2 3 3 2 2 2 1 3 1 3 3 1 3
## [260] 2 1 1 3 3 2 3 1 2 2 2 3 1 1 1 3 3 2 2 3 1 3 2 3 2 2 3 1 3 2 3 1 3 2 2 2 2
## [297] 3 3 3 2 2 2 3 1 2 3 2 3 1 1 2 2 3 3 3 3 1 1 2 3 2 1 2 3 3 1 1
table(clusters, train$Digit)
##
## clusters 1 6 7
## 1 1 85 1
## 2 113 7 15
## 3 2 2 101
kmeans()
function to cluster the
training data. Use nstart
\(=
20\). Compare your result to the one you got in part b. Are they
the same (they could be slightly different)? Can you explain the result
if they are not completely the same? # use the default kmeans() function
kmeans_result = kmeans(train_pca, centers = 3, nstart = 20)
table(kmeans_result$cluster, clusters)
## clusters
## 1 2 3
## 1 2 135 0
## 2 84 0 0
## 3 1 0 105
Question 2: Clustering of golub
data
Take the golub
data from HW3 and 4, let’s perform some
clustering on it. Remember that we will only use the gene expression
part of the data, but later on, compare our clustering with the true
class labels (golub.cl
).
hclust()
function.
single
,
complete
, and average
. library(multtest)
data(golub)
# Extract expression data and class label
X = t(as.matrix(golub))
y = as.factor(golub.cl)
# perform hierarchical clustering
hclust_single = hclust(dist(X), method = "single")
hclust_complete = hclust(dist(X), method = "complete")
hclust_average = hclust(dist(X), method = "average")
# plot the dendrogram
plot(hclust_single)
clusters = cutree(hclust_average, k = 2)
table(clusters, y)
## y
## clusters 0 1
## 1 26 11
## 2 1 0
plot(hclust_complete)
clusters = cutree(hclust_complete, k = 2)
table(clusters, y)
## y
## clusters 0 1
## 1 27 5
## 2 0 6
plot(hclust_average)
clusters = cutree(hclust_average, k = 3)
table(clusters, y)
## y
## clusters 0 1
## 1 24 0
## 2 2 11
## 3 1 0
# remove observation 21
X_sub = X[-21, ]
y_sub = y[-21]
hclust_average = hclust(dist(X_sub), method = "average")
plot(hclust_average)
clusters = cutree(hclust_average, k = 2)
table(clusters, y_sub)
## y_sub
## clusters 0 1
## 1 24 0
## 2 2 11
# perform and plot PCA on the original data
pca = prcomp(X, center = TRUE, scale. = FALSE)
newcol = c("red", "blue")[as.numeric(y)]
newcol[21] = "green"
plot(pca$x[, 1:2], col = newcol, pch = 19)
I removed observation 21 because it seems to be an outlier by the dendrogram. After removing it, the clustering result is much better. Overall, the average method seems to be the best on recovering the true class labels. However, in practice, we would not be able to know the true class labels. For example, if we perform PCA on the original data, we can see that the first PC separate the two classes very well, but observation 21 is not an outlier.
FNN
function.heatmap()
, what
information do you see? library(FNN)
n = nrow(X)
W = matrix(0, n, n)
# get neighbor index for each observation
nn = get.knn(X, k=4)
# write into W
for (i in 1:n)
W[i, nn$nn.index[i, ]] = 1
# W is not necessary symmetric
W = 0.5*(W + t(W))
# plot the adjacency matrix
heatmap(W, Rowv = NA, Colv=NA, symm = TRUE, revC = TRUE)
# compute the degree of each vertex
d = colSums(W)
# the laplacian matrix
L = diag(d) - W
# eigen-decomposition
f = eigen(L, symmetric = TRUE)
xembed = f$vectors[, (length(f$values)-2) : (length(f$values)-1) ]
scfit = kmeans(xembed, centers = 2, nstart = 20)
plot(xembed, col = c("red", "blue")[as.numeric(y)], pch = 19)
table(scfit$cluster, y)
## y
## 0 1
## 1 2 11
## 2 25 0
From the heatmap, we can see that the observations are clustered into two groups, very much aligned with the true class label. The spectral clustering result is also very similar to the true clustering labels. We should note that all of these results are still based on Euclidean distance, which is likely the cause of their similarity.
library(umap)
myumap.tuning = umap.defaults
umap.defaults$n_neighbors = 4
## Warning: package 'umap' was built under R version 4.2.2
golub.umap = umap(X)
golub.umap
## umap embedding of 38 items in 2 dimensions
## object components: layout, data, knn, config
plot(golub.umap$layout, col = c("red", "blue")[as.numeric(y)], pch = 19)
The UMAP result is similar to others and be able to almost perfectly separate the two classes. It is based on a different embedding method.