Please remove this section when submitting your homework.
Students are encouraged to work together on homework and/or utilize advanced AI tools. However, sharing, copying, or providing any part of a homework solution or code to others is an infraction of the University’s rules on Academic Integrity. Any violation will be punished as severely as possible. Final submissions must be uploaded to Gradescope. No email or hard copy will be accepted. For late submission policy and grading rubrics, please refer to the course website.
HWx_yourNetID.pdf
. For example,
HW01_rqzhu.pdf
. Please note that this must be a
.pdf
file. .html
format
cannot be accepted. Make all of your R
code chunks visible for grading..Rmd
file
as a template, be sure to remove this instruction
section.R
is \(\geq
4.0.0\). This will ensure your random seed generation is the same
as everyone else. Please note that updating the R
version
may require you to reinstall all of your packages.In this question, we will code our own k-means clustering algorithm. The key requirement is that you cannot write your code directly. You must write a proper prompt to describe your intention for each of the function so that GPT (or whatever AI tools you are using) can understand your way of thinking clearly, and provide you with the correct code. We will use the handwritten digits dataset from HW9 (2600 observations). Recall that the k-means algorithm iterates between two steps:
You do not need to split the data into train and test. We will use the whole dataset as the training data. Restrict the data to just the digits 2, 4 and 8. And then perform marginal variance screening to reduce to the top 50 features. After this, complete the following tasks. Please read all sub-questions a, b, and c before you start, and think about how different pieces of the code should be structured and what the inputs and outputs should be so that they can be integrated. For each question, you need to document your prompt to GPT (or whatever AI tools you are using) to generate the code. You cannot wirte your own code or modify the code generated by the AI tool in any of the function definitions.
[20 pts] In this question, we want to ask GPT to write a function
called cluster_mean_update()
that takes in three arguments,
the data \(X\), the number of clusters
\(K\), and the cluster assignments. And
it outputs the updated centroids. Think about how you should describe
the task to GPT (your specific requirements of how these arguments and
the output should structured) so that it can understand your intention.
You need to request the AI tool to provide sufficient comments for each
step of the function. After this, test your function with the training
data, \(K = 3\) and a random cluster
assignment.
[20 pts] Next, we want to ask GPT to write a function called
cluster_assignments()
that takes in two arguments, the data
\(X\) and the centroids. And it outputs
the cluster assignments. Think about how you should describe the task to
GPT so that this function would be compatible with the previous function
to achieve the k-means clustering. You need to request the AI tool to
provide sufficient comments for each step of the function. After this,
test your function with the training data and the centroids from the
previous step.
[20 pts] Finally, we want to ask GPT to write a function called
kmeans()
. What arguments should you supply? And what
outputs should be requested? Again, think about how you should describe
the task to GPT. Test your function with the training data, \(K = 3\), and the maximum number of
iterations set to 20. For this code, you can skip the multiple starting
points strategy. However, keep in mind that your solution maybe
suboptimal.
[5 pts] After completing the above tasks, check your clustering results with the true labels in the training dataset. Is your code working as expected? What is the accuracy of the clustering? You are not restricted to use the AI tool from now on. Comment on whether you think the code generated by GPT can be improved (in any ways).
In this question, we will use the hierarchical clustering algorithm
to cluster the training data. We will use the same training data as in
Question 1. Directly use the hclust()
function in R to
perform hierarchical clustering, but test different linkage methods
(single, complete, and average) and euclidean distance.
[10 pts] Plot the three dendrograms and compare them. What do you observe? Which linkage method do you think is the most appropriate for this dataset?
[10 pts] Choose your linkage method, cut the dendrogram to obtain 3 clusters and compare the clustering results with the true labels in the training dataset. What is the accuracy of the clustering? Comment on its performance.
For this question, let’s use the spectral clustering function
specc()
from the kernlab
package. Let’s also
consider all pixels, instead of just the top 50 features. Specify your
own choice of the kernel and the number of clusters. Report your results
and compare them with the previous clustering methods.