Final Project Overview
You have two options to complete the final project. Graduate students must choose [Option
2] . The project is due by 11:59 PM, Thur, Dec 11th.
- [Option 1] I will provide you with a dataset
(usually large with complex structures). You will be asked to perform
the following:
- A brief literature review
- A series of unsupervised learning, classification and regression
tasks
- Present and interpret your results
- Your are encouraged to schedule a meeting with me before the end of
Nov to review and discuss the progress of your project
- You can volunteer to present your project by the end of the
semester. However, this will not affect your final score. Decision needs
to be made before 5PM, Nov 1st.
- [Option 2] You can also propose a data analytic
problem yourself
- Mandatory for graduate students (4 Credit Hours), but undergraduate
can also use this option
- The data can come from a public data repository, but it should be
sufficiently large and complex
- The goal of this project cannot be a simple classification or
regression problem
- You need my approval. To do this, you and your team
need to set up a meeting with me no
later than Oct 24th to discuss your analysis goal and
the plan before you start working on the project.
- Once approved, use this [google spreadsheet]
- Project update Your team needs to communicate and
report to me regarding your progression at least one more time before
the final presentation. You can also schedule meetings with me during
this process.
- In addition to submitting the final project report,
which should have a similar format as [Option 1], your
team must also do a 15 min in class presentation of your
project. The presentation will be scheduled during the last two
weeks.
For both options, each team can have up to three team
members. You should think about forming a team early. Here are
several previous final projects, with some presentations. Please note
that some of the policies in previous years may not apply to
this semester. Make sure that you read the current semester’s
requirement carefully.
Final Project [Option 1]: Housing Prices Prediction
- Timeline
- Release Date: October 29th, 2025
- Due Date: 11:59 PM, Thursday, December 11th,
2025
- No late submission will be
accepted
- Data Information
- Submission Guidelines
- Designate one team member as the team lead to
submit the complete report via Gradescope. Use the
group submission feature and list all team
members.
- The full report must be submitted in
.pdf format. It
should consist of a cover page, a main body not exceeding 12
pages, and possibly supplementary sections if needed.
- Clearly indicate the names and NetIDs of all team members on the
cover page of the report.
- Supplementary sections can extend the report to a maximum of 20
pages to display additional figures, tables, or results supporting your
findings.
Project Report Requirement
- [5 Points, 1 page] Project Description and Abstract
- Provide a concise summary of your project, including its goal,
approach, and key findings, similar to an abstract of a paper.
- Highlight the main idea behind your feature engineering
strategy — what new feature(s) did you design, and what
intuition motivated it?
- Briefly mention any specialized methods or learning algorithms used
in your analysis, especially if they extend beyond the techniques
covered in class.
- Include the names of all team members and their NetIDs.
- Provide a brief statement of AI usage
in your project (e.g., using AI tools for code, grammar checking, etc.).
Follow Elsevier’s
generative AI guidelines.
- [10 Points, 1 page] Literature Review
- Use Google Scholar to find at least two research papers related to
housing price prediction, feature engineering
in tabular data, or anything relevant to your project.
- Summarize the main findings and methodologies of these studies.
Provide proper citations and URLs. Make sure that you use proper
academic citation format.
- Comment on if their approaches to feature engineering or model
interpretation inspired your project.
- [20 Points, 2–3 pages] Data Processing and Summary
Statistics
- The dataset contains many categorical variables, text variables, and
missing value, and some variables probably contains extream values.
Describe how you processed these variables.
- Select some representative examples of your data processing steps
(e.g., handling missing values, encoding categorical variables) and
explain your rationale. You do not need to document every single
variable.
- Provide summary statistics (tables, frequency plots, histograms) for
a few variables that you think are important for predicting
SalePrice.
- If your feature design or data transformation draws on prior
studies, cite them properly.
- You may revisit this step later to refine features based on model or
clustering results; document any changes clearly.
- [15 Points, 2–3 pages] Unsupervised Learning
- Apply at least three clustering algorithms to explore patterns in
the data. You may choose to use all or a subset of the features.
- Justify your choice of algorithms and any preprocessing steps (e.g.,
scaling, dimensionality reduction), including parameter settings.
- Discuss how the categorical variables are used or mixed with others
(including the continous ones) when you do this step.
- Present visualizations of your clustering results and discuss the
interpretability of clusters — for example, do they separate
neighborhoods, house styles, or other meaningful subgroups?
- Examine the relationship between your clusters and the target
variable (
SalePrice).
- Summarize how insights from unsupervised learning could guide
further feature engineering or model building.
- [30 Points, 3–4 pages] Prediction Models
- Implement at least five regression models to predict
SalePrice, tuning their parameters appropriately.
- At least four models must be ones introduced in class. You may
include one new model if you describe it in detail.
- All linear/penalized linear models are treated as the same type of
model; KNN and Nadaraya–Watson are treated as the same one.
- For all non-linear models (e.g., tree-based or kernel methods),
explain how you control model complexity.
- Report performance using cross-validation and clearly state
evaluation metrics (e.g., RMSE, \(R^2\)). You should tune the models to
optimize performance. But properly state what metrics you use for tuning
and final evaluation.
- Highlight which features, especially your constructed
interpretable feature(s), contribute most strongly to
predictive performance.
- Discuss their real-world meaning — how do these features explain
variation in housing prices?
- [10 Points, 1–2 pages] Feature Engineering Challenge
(Open-Ended Question)
- Design and implement one interpretable, data-driven
feature that you believe captures an important hidden aspect of
housing value. This feature can be derived by using any supervised or
unsupervised learning techniques.
- Justify your method and interpretation: What does this feature
represent? How can it be understood in the context of housing
markets?
- Demonstrate how including this feature affects model performance.
They do not need to significantly improve prediction accuracy, but they
should provide meaningful insights.
- Discuss advantages, limitations, and potential extensions of your
constructed feature. Provide citations if applicable.
- [1-5 Bonus Points] Bonus: Submit to Kaggle
- Apply your best model to the test set provided by Kaggle and submit
your predictions to the competition.
- Include your Kaggle username and the achieved score (do a sreenshot)
in your report.
- If your team ranks in the top 20 on the leaderboard, each team
member will receive 5 bonus points. Top 100 will get 3 bonus points.
Others will get 1 bonus point.
- [10 Points] General Requirements
- Clarity, organization, and logical flow.
- All modeling decisions are justified and reproducible.
- Report should be understandable to someone unfamiliar with the
dataset.
- Figures and tables are clear, concise, readable and appropriately
sized. You should pay attention to things like caption, axis labels,
legends, and color schemes.
- Proper citations for all external sources, including papers, posts
from Kaggle, code snippets, and datasets.
- Code displayed in the report should minimized to only the essential
parts. You can provide a complete version of your code as the
supplementary file (append them after your final report) if needed.
Self-Proposed Project [Option 2] Presentations
Other Announcements