Final Project Overview
You have two options to complete the final project.
- [Option 1]: I will provide you with a dataset (usually large with complex structures) and you will be asked to perform the following:
- A literature review
- A series of unsupervised learning, classification and regression tasks
- Present and interpret your results
- [Option 2]: You can also propose a data analytic problem yourself
- The data can come from any public data repository, but the problem needs to be sufficiently complex and the data should not be too small
- You need my approval. To do this, you and your team need to set up a meeting with me no later than Mar 31st to discuss your analysis goal and plan before you start to work on the project
- Meet/communicate with me at least one more time to report your progress
- Do a 15 min presentation of your results (most likes during the last two weeks).
- You still need to complete a report similar to [Option 1] summarize your results.
For both options, each team can have up to three team members. You should consider forming a team early. Here are the previous final projects, with some presentations. Please note that some of the policies in previous years may not apply to this semester. Make sure that you read the current semester’s requirement carefully.
Final Project
- Timeline
- Release date: Apr 2, 2022
- Due date: 11:59 PM, Thur, May 5th
- Data Information
- We will use the Kaggle Fashion MNIST data
- Download the training data:
fashion-mnist_train.csv
and testing data: fashion-mnist_test.csv
- All model training process should be done on the training data, and report results on the testing data.
- The data contains 70,000 28 \(\times\) 28 images (60,000 as training and 10,000 as testing)
- This is a multi-class classification problem. The goal is to predict the class label in the testing data.
- Submission Guideline
- Pick one team member as the team lead to submit the full report and supplementary files to Gradescope.
- The full report should be in
.pdf
format. It should include a cover page and no more than 12 pages of contents. You are allowed to include supplementary files after the main body of your report. However, they will not be used for grading. Name your file the same way as homework, e.g., Project_yourNetID.pdf
.
- On the cover page of the report, you should clearly indicate all of your team members (their names and NetID).
- Other team members should submit a one-page file (same as the cover page), which indicates only the members of the team and their team lead. DO NOT let multiple team members summit the same/different final reports. Doing so will lead to a 5-points penalty to all of your team members, and if those reports are different, the lowest score will be used.
Project Report Requirement
- [5 Points, 1 page] Project description and summary. This part should summarize your goal, approach, and conclusion.
- [10 Points, 1 page] Literature review. This Fashion-MNIST dataset is a pretty popular data used in the literature to benchmark machine learning models. Google search the relevant literature of this dataset. Then do the following:
- To the best of your knowledge after these searches, what is the best accuracy on this dataset so far?
- Read two papers/reports among the ones you searched. Summarize their approaches in language that can be easily understood. Provide appropriate citations and links to these papers.
- [20 Points, 1-2 page] Summary Statistics, data processing and unsupervised learning.
- Provide a frequency table of the outcome variable in both training and testing data.
- Perform two clustering algorithms to the training data. How many clusters you decide to use? What is the dominating \(y\) label in each of these clusters? Do your clusters help to separate these labels?
- [20 Points, 1-3 pages] Multi-class Classification Model
- We have learned several different models that can perform multi-class classification. In this question, choose two of them and properly tune these models. Report the overall classification error and provide sufficient information (table/figure and descriptions) to demonstrate the model fitting results.
- Many models we learned are only for binary classifications. Try to extend one of them to handle multi-class problems. You can either search the existing literature to implement the method (or use their packages) or write your own code based on an ideal you have. Clearly describe your method framework, demonstrate the results and compare that with the two previous models.
- [30 Points, 2-4 pages] Ensemble Model and Feature Engineering
- The models we learned in this course could all be restrictive and limited to certain type of data structure. Hence it could be beneficial to build multiple models and combine the information together in a joint model. The basic idea is to perform this in two stages:
- In the first stage, fit several different models to the data, and record their outputs (this could be the predicted label, scaler outputs, or anything that you think would be useful).
- In the second stage, build a new model that uses the outputs in the previous stage as input features and predict the label.
- As you can see, there is a great amount of flexibility in this model building process. And the idea is to utilize and possibly combine advantages from different models. Make your own choice on what models to use in different stages. But your model must satisfy the following:
- The inputs in your second stage (i.e., outputs from your first stage) should come from no more than four different models in the first stage (these models can be either supervised or unsupervised)
- The number of inputs in the second stage should be no more than 20
- You need to clearly describe how the models are built. For example, how do you extract the outputs from the first stage model, how these models were tuned, etc.. You may also need to consider the computational cost involved in this process. Provide sufficient information (table/figure and descriptions) to demonstrate the model fitting results.
- Report the overall classification error on the testing data. The accuracy will be used as a competition among all groups. This part consists of 5 points for the total score.
- [15 Points] General requirements
- Is your report easy to read with clear logic?
- For any model fitting with subjective decisions, provide reasoning for your decisions.
- Is it written in a manner such that a reader does not need to be very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and compact enough? For example, you should not include a super long table or a huge figure that takes a whole page.
- Are irrelevant/trivial code/output hidden? Overall, no more than 1/4 of the space should be used as displaying the code.