Final Project Overview
You have two options to complete the final project. Graduate students must choose [Option
2] . The project is due by 11:59 PM, Thur, Dec 12th.
- [Option 1] I will provide you with a dataset
(usually large with complex structures). You will be asked to perform
the following:
- A brief literature review
- A series of unsupervised learning, classification and regression
tasks
- Present and interpret your results
- Your are encouraged to schedule a meeting with me before the end of
Nov to review and discuss the progress of your project
- You can volunteer to present your project by the end of the
semester. However, this will not affect your final score. Decision needs
to be made before 5PM, Nov 1st.
- [Option 2] You can also propose a data analytic
problem yourself
- Mandatory for graduate students (4 Credit Hours), but undergraduate
can also use this option
- The data can come from a public data repository, but it should be
sufficiently large and complex
- The goal of this project cannot be a simple classification or
regression problem
- You need my approval. To do this, you and your team
need to set up a meeting with me no
later than Oct 25th to discuss your analysis goal and
the plan before you start working on the project.
- Project update Your team needs to communicate and
report to me regarding your progression at least two times before the
final presentation. You can also schedule meetings with me during this
process.
- In addition to submitting the final project report,
which should have a similar format as [Option 1], your
team must also do a 15 min in class presentation of your
project. The presentation will be scheduled during the last two
weeks.
For both options, each team can have up to three team
members. You should think about forming a team early. Here are
several previous final projects, with some presentations. Please note
that some of the policies in previous years may not apply to
this semester. Make sure that you read the current semester’s
requirement carefully.
Final Project [Option 1]: Used Car Price Prediction
- Timeline
- Release Date: October 28th, 2024
- Due Date: 11:59 PM, Thursday, December 12th,
2024
- No late submission will be
accepted
- Data Information
- Submission Guidelines
- Designate one team member as the team lead to
submit the complete report via Gradescope. Use the
group submission feature and list all team
members.
- The full report must be submitted in
.pdf
format. It
should consist of a cover page, a main body not exceeding 12
pages, and possibly supplementary sections if needed.
- Clearly indicate the names and NetIDs of all team members on the
cover page of the report.
- Supplementary sections can extend the report to a maximum of 20
pages to display additional figures, tables, or results supporting your
findings.
Project Report Requirement
- [5 Points, 1 page] Project Description and
Abstract:
- Provide a concise summary of your project, including its goal,
approach, and key findings, similar to an abstract.
- Briefly mention any unique or specialized methods used in your
analysis, especially if they extend beyond the techniques covered in
this course. Explain how these methods differ from those reviewed in the
Literature Review section.
- Include the names of all team members and their NetIDs.
- Provide a brief statement of AI usage
in your project. For example, was any AI tool used in competing the
code, correcting grammar errors, etc.? You should be aware that some
behaviors are prohibited by the university’s academic integrity policy.
For a general guideline, I would refer to this
statement by Elsevier.
- [10 Points, 1 page] Literature Review
- Use Google Scholar to find research papers on used car price
prediction, noting that these studies may use different datasets.
- Summarize the main findings and approaches from at least two
selected papers. Provide full citations and URLs for each paper.
- [20 Points, 2-3 page] Data Processing and Summary
Statistics:
- Many variables in this dataset are in text format. For example, the
engine
variable may contain information such as horsepower,
the number of cylinders, and fuel type. Process these variables to
extract relevant information and create new variables for analysis.
- Describe each data processing step, including your reasoning and any
software packages used.
- Some variables also contain missing values. Discuss how you handled
these missing values and why you chose that approach.
- After constructing your final dataset, provide a table of summary
statistics (search online if you’re unfamiliar with this concept) for
the key variables. You can also consider using frequency table,
histogram plot, etc to summarize the variables efficiently.
- If your approach of creating new variables was inspired from the
literature, include proper citations.
- You may revisit this step after the unsupervised and supervised
learning phase to refine your data processing. If so, clearly explain
the changes made and why they were necessary.
- [15 Points, 2-3 pages] Unsupervised Learning
- Apply at least three clustering algorithms to the processed
dataset.
- Determine the appropriate number of clusters and discuss the
interpretability of these clusters. Do they hold any meaningful
distinctions?
- Examine whether the clustering results are associated with your
outcome variable.
- Summarize insights from the clustering results. How could they be
useful for your supervised learning steps?
- [30 Points, 3-4 pages] Prediction Models
- Implement at least five different regression models to predict the
price and tune their parameters appropriately.
- You may use models beyond those covered in this course. However, at
least four models should be ones introduced in class. If you incorporate
a new model, provide a detailed description of it.
- All linear models (e.g., OLS) and penalized linear models (Lasso,
Ridge, Elastic Net) will be treated as the same model. KNN and
Nadaraya-Watson kernel estimators are treated as the same model. If you
use KNN or NW Kernel Estimators, you should discuss what distance metric
is used, especially for the categorical variables.
- Tune each model carefully, and clearly explain the tuning process
for each. State your evaluation criteria and, if multiple criteria are
used, discuss their advantages and disadvantages.
- Provide sufficient information, including tables, figures, and
explanations, to illustrate the model-fitting results. Which model
appears to perform best?
- Identify the variable(s) you constructed that seem to be most
predictive, and provide interpretations of their impact on the
model.
- [10 Points, 1 page] Open-Ended Question
- A researcher is interested in estimating the original price of the
cars in your dataset as if they were brand new. How would you approach
this problem?
- Since your dataset lacks information on new car prices, some form of
extrapolation may be necessary (but feel free to explore alternative
ideas). Discuss the challenges and limitations your approach may
face.
- Perform this prediction (use just one model is sufficient) by
selecting three cars from your dataset and estimating their price as if
they were new. Search online for the original release prices of these
cars and compare these with your predictions. Discuss any discrepancies
and potential reasons for these differences.
- [10 Points] General requirements
- Is your report easy to read with clear logic?
- For any model fitting with subjective decisions, provide reasoning
for your decisions.
- Is it written in a manner such that a reader does not need to be
very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and
compact enough? For example, you should not include a super long table
or a huge figure that takes a whole page.
- Are irrelevant/trivial code/output hidden? Overall, no more
than 2 pages of the space should be used as displaying the
code.
Self-Proposed Project (Option 2) Presentations
- Section: Dec 3, 11:00 - 12:20
- Predicting the level of problematic internet usage exhibited
by children and adolescents based on their physical activity by
Nidhi Baheti, Jyot Buch and Bhavana Sanghi. [Data
Source]
- A blend of tunes and analytics, reflecting the combination
of music and data analysis by Aysu Maharramli, Aadya Ranjan and
Shruti Umakant Gharate. [Data
Source]
- Drug recommendtation via efficacy and adverse event
modeling by Amitabh Swain and Aniket Tathe. [Data Source 1] [Data
Source 2]
- How Do Mental Health and Healthcare Access Shape Diabetes
Risk? A Study of Understudied Populations Using Latent Profiles and
Factor Analysis by Ash Sharma, Joe Li and Samyak Pokharna. [Data
Source]
- Dec 3, 12:30 - 1:50
- Predicting Tweet Locations: A Machine Learning Approach to
Text Analysis by Ramanan Srirajan, Bochuan Zhang and Vladislav
Fedorov. [Data
Source]
- Effect of Early Vasopressor on mortality of sepsis shock
patients by Anshika Pradhan and Alarsh Tiwari. [Data Source]
- The Impact of California’s Proposition 47 on Misdemeanor
Crime in Los Angeles by Zirui Pang, Qiyang Wang and Wan Wen.
[Data Source 1] Data Source 2
- Enhancing Targeted Marketing Predictions with Imbalance
Adjustment Methods by Ying-Han Kao and Hanqi Tang. [Data
Source]
- Dec 5, 11:00 - 12:20
- Improving Popularity of Los Angeles AirBnb
Properties by Nay Petrucelli and Mehmet Korkmaz. [Data
Source]
- Building violations and risk preventing Chicago by
Sreeman Etikyala and Viswanath Vadlamani
- Jet engine maintainence and optimization by Tanmay
Shikhare and Mohini Nath. [Data
Source]
- Title TBD by Zijian Wang and Jingyi Chen
- Dec 5, 12:30 - 1:50
- Assessing the Causal Impact of various features to predict
Loan Defaulters by Mubeen Hasan and Sarthak Morj [Data
Source]
- Predicting Corporate Longevity and Bankruptcy Risk: A
Survival Analysis Approach by Napaton Prasertthum, Riki Komano
and Leo Yu. [Data
Source]
- Music popularity prediction and the effect to mental
health by Ziwei Li and Jinghang Zhou. [Data
Source 1] [Data
Source 2]
- A Comparative Study of Imputation Methods and Model Stacking
Strategies on Credit Risk Prediction by Weifeng Liu and Mingjun
Kong. [Data
Source]
Other Announcements