Final Project: Fashion MNIST Classification
- Timeline
- Release date: Mar 31, 2022
- Due date: 11:59 PM, Thur, May 5th
- Data Information
- We will use the Kaggle Fashion MNIST data
- Download the training data:
fashion-mnist_train.csv
and
testing data: fashion-mnist_test.csv
- The data contains 70,000 28 \(\times\) 28 images (60,000 as training and
10,000 as testing)
- This is a multi-class classification problem. The goal is to predict
the class label in the testing data. Our project is separated into two
parts, in the first part, we will use only two classes and perform
binary classification. In the second part, we use all classes.
- Submission Guideline
- Pick one team member as the team lead to submit the full
report and supplementary files to Gradescope.
- The full report should be in
.pdf
format. It should
include a cover page and no more than 12 pages of
contents. You are allowed to include supplementary files after
the main body of your report. However, they will not be used for
grading. Name your file the same way as homework, e.g.,
Project_yourNetID.pdf
.
- On the cover page of the report, you should clearly
indicate all of your team members (their names and NetID).
- Other team members should submit a one-page file (same as the cover
page), which indicates only the members of the team and their team lead.
DO NOT let multiple team members summit the
same/different final reports. Doing so will lead to a 5-points penalty
to all of your team members, and if those reports are different, the
lowest score will be used.
Project Report Requirement
- [5 Points, 1 page] Project description and summary.
This part should summarize your goal, approach, and conclusion.
- [10 Points, 1 page] Literature review. This
Fashion-MNIST dataset is a pretty popular data used in the literature to
benchmark machine learning models. Google search the relevant literature
of this dataset. Then do the following:
- To the best of your knowledge after these searches, what is the best
accuracy on this dataset so far?
- Read two papers/reports among the ones you searched. Summarize their
approaches in language that can be easily understood. Provide
appropriate citations and links to these papers.
- [20 Points, 1-2 page] Summary Statistics, data processing
and unsupervised learning.
- Provide a frequency table of the outcome variable in both training
and testing data.
- Perform a PCA analysis of the data and mark the points with their
labels. Since we have a large data, you need to present the results in a
way that is readable.
- Perform a clustering algorithm to the training data. How many
clusters you decide to use? What is the dominating \(y\) label in each of these clusters? Do
your clusters help to separate these labels?
- [25 Points, 2-3 pages] Binary Classification: Coat
vs. Shirt
- Subset both your training and testing data to just Coat
vs. Shirt
- Use at least two different classification models (choosing from
random forests, boosting, logistic, SVM) to fit the training data.
- Give details for how you tuned the model. What is your evaluation
criteria?
- Evaluate your model on the testing data. What is the accuracy you
obtained?
- You need to provide sufficient information (table/figure and
descriptions) to demonstrate the model fitting results
- [25 Points, 2-3 pages] Multi-class classification
- Use all the data for this question
- You should use at least two different approaches for this task and
one of them should be KNN.
- Report the training time for each model you used. If you used
cross-validation, do not report the cross-validation time. Report just
the model fitted on the best tuning parameter.
- Again, give details for how you tuned the model. What is your
evaluation criteria?
- You need to provide sufficient information (table/figure and
descriptions) to demonstrate the model fitting results
- Report your classification accuracy on the testing data. This value
will be used as a competition. This part worth 5 points.
- [15 Points] General requirements
- Is your report easy to read with clear logic?
- For any model fitting with subjective decisions, provide reasoning
for your decisions.
- Is it written in a manner such that a reader does not need to be
very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and
compact enough? For example, you should not include a super long table
or a huge figure that takes a whole page.
- Are irrelevant/trivial code/output hidden? Overall, no more
than 1/4 of the space should be used as displaying the
code.
Self-Proposed Project Presentations
We have two days of presentations: Apr 28 and May 3. Here is a list
of our teams and their topics.
- [Apr 28
Zoom Recording]
- “Used Car Price Prediction” by Nicholas Choi, Eunjeong Ro and Henu
Park
- “Species Distribution Modeling” by Kun Hu, Vishwadeepsinh Sarvaiya
and Zixuan Wang
- “Microsoft Malware Prediction” by Neha Jain, Chiranjeevi Konduru and
Pallaw Kumar
- “Telecom Churn Clustering Analysis” by Bo Yang, Xiaoying Yang and
Yin Tip Ho
- [May 3
Zoom Recording]
- “Airline Passenger Satisfaction Prediction and Analysis” by Mayank
Agarwal, Yash Bajaj
- “Li-ion Battery Cell Analysis” by Xiangrui Deng, Ke Shao and Danny
Song
- “Credit Card Default Prediction” by Zhaohong Wang and Sunbeom
Kwon
- “Credit Card Fraud Detection” by Yumeng Li and Xinkai Zhao
- “Predicting Career Length for NBA Rookies” by Jack Fletcher
- “Breast Cancer Prediction” by Rishab Kulkarni