Archived Final Project, Spring 2022

Final Project: Fashion MNIST Classification

Timeline
- Release date: Mar 31, 2022
- Due date: 11:59 PM, Thur, May 5th
Data Information
- We will use the Kaggle Fashion MNIST data
- Download the training data: fashion-mnist_train.csv and testing data: fashion-mnist_test.csv
- The data contains 70,000 28 \(\times\) 28 images (60,000 as training and 10,000 as testing)
- This is a multi-class classification problem. The goal is to predict the class label in the testing data. Our project is separated into two parts, in the first part, we will use only two classes and perform binary classification. In the second part, we use all classes.
Submission Guideline
- Pick one team member as the team lead to submit the full report and supplementary files to Gradescope.
- The full report should be in .pdf format. It should include a cover page and no more than 12 pages of contents. You are allowed to include supplementary files after the main body of your report. However, they will not be used for grading. Name your file the same way as homework, e.g., Project_yourNetID.pdf.
- On the cover page of the report, you should clearly indicate all of your team members (their names and NetID).
- Other team members should submit a one-page file (same as the cover page), which indicates only the members of the team and their team lead. DO NOT let multiple team members summit the same/different final reports. Doing so will lead to a 5-points penalty to all of your team members, and if those reports are different, the lowest score will be used.

Project Report Requirement

[5 Points, 1 page] Project description and summary. This part should summarize your goal, approach, and conclusion.
[10 Points, 1 page] Literature review. This Fashion-MNIST dataset is a pretty popular data used in the literature to benchmark machine learning models. Google search the relevant literature of this dataset. Then do the following:
- To the best of your knowledge after these searches, what is the best accuracy on this dataset so far?
- Read two papers/reports among the ones you searched. Summarize their approaches in language that can be easily understood. Provide appropriate citations and links to these papers.
[20 Points, 1-2 page] Summary Statistics, data processing and unsupervised learning.
- Provide a frequency table of the outcome variable in both training and testing data.
- Perform a PCA analysis of the data and mark the points with their labels. Since we have a large data, you need to present the results in a way that is readable.
- Perform a clustering algorithm to the training data. How many clusters you decide to use? What is the dominating \(y\) label in each of these clusters? Do your clusters help to separate these labels?
[25 Points, 2-3 pages] Binary Classification: Coat vs. Shirt
- Subset both your training and testing data to just Coat vs. Shirt
- Use at least two different classification models (choosing from random forests, boosting, logistic, SVM) to fit the training data.
- Give details for how you tuned the model. What is your evaluation criteria?
- Evaluate your model on the testing data. What is the accuracy you obtained?
- You need to provide sufficient information (table/figure and descriptions) to demonstrate the model fitting results
[25 Points, 2-3 pages] Multi-class classification
- Use all the data for this question
- You should use at least two different approaches for this task and one of them should be KNN.
- Report the training time for each model you used. If you used cross-validation, do not report the cross-validation time. Report just the model fitted on the best tuning parameter.
- Again, give details for how you tuned the model. What is your evaluation criteria?
- You need to provide sufficient information (table/figure and descriptions) to demonstrate the model fitting results
- Report your classification accuracy on the testing data. This value will be used as a competition. This part worth 5 points.
[15 Points] General requirements
- Is your report easy to read with clear logic?
- For any model fitting with subjective decisions, provide reasoning for your decisions.
- Is it written in a manner such that a reader does not need to be very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and compact enough? For example, you should not include a super long table or a huge figure that takes a whole page.
- Are irrelevant/trivial code/output hidden? Overall, no more than 1/4 of the space should be used as displaying the code.

Self-Proposed Project Presentations

We have two days of presentations: Apr 28 and May 3. Here is a list of our teams and their topics.

[Apr 28 Zoom Recording]
- “Used Car Price Prediction” by Nicholas Choi, Eunjeong Ro and Henu Park
- “Species Distribution Modeling” by Kun Hu, Vishwadeepsinh Sarvaiya and Zixuan Wang
- “Microsoft Malware Prediction” by Neha Jain, Chiranjeevi Konduru and Pallaw Kumar
- “Telecom Churn Clustering Analysis” by Bo Yang, Xiaoying Yang and Yin Tip Ho
[May 3 Zoom Recording]
- “Airline Passenger Satisfaction Prediction and Analysis” by Mayank Agarwal, Yash Bajaj
- “Li-ion Battery Cell Analysis” by Xiangrui Deng, Ke Shao and Danny Song
- “Credit Card Default Prediction” by Zhaohong Wang and Sunbeom Kwon
- “Credit Card Fraud Detection” by Yumeng Li and Xinkai Zhao
- “Predicting Career Length for NBA Rookies” by Jack Fletcher
- “Breast Cancer Prediction” by Rishab Kulkarni

Archived Final Project, Spring 2022

Ruoqing Zhu

Final Project: Fashion MNIST Classification

Project Report Requirement

Self-Proposed Project Presentations