Final Project Overview

You have two options to complete the final project. Graduate students must choose [Option 2] . The project is due by 11:59 PM, Thur, Dec 11th.

For both options, each team can have up to three team members. You should think about forming a team early. Here are several previous final projects, with some presentations. Please note that some of the policies in previous years may not apply to this semester. Make sure that you read the current semester’s requirement carefully.


Final Project [Option 1]: Housing Prices Prediction

Project Report Requirement

  • [5 Points, 1 page] Project Description and Abstract
    • Provide a concise summary of your project, including its goal, approach, and key findings, similar to an abstract of a paper.
    • Highlight the main idea behind your feature engineering strategy — what new feature(s) did you design, and what intuition motivated it?
    • Briefly mention any specialized methods or learning algorithms used in your analysis, especially if they extend beyond the techniques covered in class.
    • Include the names of all team members and their NetIDs.
    • Provide a brief statement of AI usage in your project (e.g., using AI tools for code, grammar checking, etc.). Follow Elsevier’s generative AI guidelines.
  • [10 Points, 1 page] Literature Review
    • Use Google Scholar to find at least two research papers related to housing price prediction, feature engineering in tabular data, or anything relevant to your project.
    • Summarize the main findings and methodologies of these studies. Provide proper citations and URLs. Make sure that you use proper academic citation format.
    • Comment on if their approaches to feature engineering or model interpretation inspired your project.
  • [20 Points, 2–3 pages] Data Processing and Summary Statistics
    • The dataset contains many categorical variables, text variables, and missing value, and some variables probably contains extream values. Describe how you processed these variables.
    • Select some representative examples of your data processing steps (e.g., handling missing values, encoding categorical variables) and explain your rationale. You do not need to document every single variable.
    • Provide summary statistics (tables, frequency plots, histograms) for a few variables that you think are important for predicting SalePrice.
    • If your feature design or data transformation draws on prior studies, cite them properly.
    • You may revisit this step later to refine features based on model or clustering results; document any changes clearly.
  • [15 Points, 2–3 pages] Unsupervised Learning
    • Apply at least three clustering algorithms to explore patterns in the data. You may choose to use all or a subset of the features.
    • Justify your choice of algorithms and any preprocessing steps (e.g., scaling, dimensionality reduction), including parameter settings.
    • Discuss how the categorical variables are used or mixed with others (including the continous ones) when you do this step.
    • Present visualizations of your clustering results and discuss the interpretability of clusters — for example, do they separate neighborhoods, house styles, or other meaningful subgroups?
    • Examine the relationship between your clusters and the target variable (SalePrice).
    • Summarize how insights from unsupervised learning could guide further feature engineering or model building.
  • [30 Points, 3–4 pages] Prediction Models
    • Implement at least five regression models to predict SalePrice, tuning their parameters appropriately.
    • At least four models must be ones introduced in class. You may include one new model if you describe it in detail.
    • All linear/penalized linear models are treated as the same type of model; KNN and Nadaraya–Watson are treated as the same one.
    • For all non-linear models (e.g., tree-based or kernel methods), explain how you control model complexity.
    • Report performance using cross-validation and clearly state evaluation metrics (e.g., RMSE, \(R^2\)). You should tune the models to optimize performance. But properly state what metrics you use for tuning and final evaluation.
    • Highlight which features, especially your constructed interpretable feature(s), contribute most strongly to predictive performance.
    • Discuss their real-world meaning — how do these features explain variation in housing prices?
  • [10 Points, 1–2 pages] Feature Engineering Challenge (Open-Ended Question)
    • Design and implement one interpretable, data-driven feature that you believe captures an important hidden aspect of housing value. This feature can be derived by using any supervised or unsupervised learning techniques.
    • Justify your method and interpretation: What does this feature represent? How can it be understood in the context of housing markets?
    • Demonstrate how including this feature affects model performance. They do not need to significantly improve prediction accuracy, but they should provide meaningful insights.
    • Discuss advantages, limitations, and potential extensions of your constructed feature. Provide citations if applicable.
  • [1-5 Bonus Points] Bonus: Submit to Kaggle
    • Apply your best model to the test set provided by Kaggle and submit your predictions to the competition.
    • Include your Kaggle username and the achieved score (do a sreenshot) in your report.
    • If your team ranks in the top 20 on the leaderboard, each team member will receive 5 bonus points. Top 100 will get 3 bonus points. Others will get 1 bonus point.
  • [10 Points] General Requirements
    • Clarity, organization, and logical flow.
    • All modeling decisions are justified and reproducible.
    • Report should be understandable to someone unfamiliar with the dataset.
    • Figures and tables are clear, concise, readable and appropriately sized. You should pay attention to things like caption, axis labels, legends, and color schemes.
    • Proper citations for all external sources, including papers, posts from Kaggle, code snippets, and datasets.
    • Code displayed in the report should minimized to only the essential parts. You can provide a complete version of your code as the supplementary file (append them after your final report) if needed.

Self-Proposed Project [Option 2] Presentations


Other Announcements