Final Project

Final Project Overview

You have two options to complete the final project. Graduate students must choose [Option 2] . The project is due by 11:59 PM, Thur, Dec 12th.

[Option 1] I will provide you with a dataset (usually large with complex structures). You will be asked to perform the following:
- A brief literature review
- A series of unsupervised learning, classification and regression tasks
- Present and interpret your results
- Your are encouraged to schedule a meeting with me before the end of Nov to review and discuss the progress of your project
- You can volunteer to present your project by the end of the semester. However, this will not affect your final score. Decision needs to be made before 5PM, Nov 1st.
[Option 2] You can also propose a data analytic problem yourself
- Mandatory for graduate students (4 Credit Hours), but undergraduate can also use this option
- The data can come from a public data repository, but it should be sufficiently large and complex
- The goal of this project cannot be a simple classification or regression problem
- You need my approval. To do this, you and your team need to set up a meeting with me no later than Oct 25th to discuss your analysis goal and the plan before you start working on the project.
- Project update Your team needs to communicate and report to me regarding your progression at least two times before the final presentation. You can also schedule meetings with me during this process.
- In addition to submitting the final project report, which should have a similar format as [Option 1], your team must also do a 15 min in class presentation of your project. The presentation will be scheduled during the last two weeks.

For both options, each team can have up to three team members. You should think about forming a team early. Here are several previous final projects, with some presentations. Please note that some of the policies in previous years may not apply to this semester. Make sure that you read the current semester’s requirement carefully.

Final Project [Option 1]: Used Car Price Prediction

Timeline
- Release Date: October 28th, 2024
- Due Date: 11:59 PM, Thursday, December 12th, 2024
- No late submission will be accepted
Data Information
- We will utilize the Used Car Price Prediction Dataset from Kaggle
- The dataset consists of 4009 logs of used car information, with nine features
Submission Guidelines
- Designate one team member as the team lead to submit the complete report via Gradescope. Use the group submission feature and list all team members.
- The full report must be submitted in .pdf format. It should consist of a cover page, a main body not exceeding 12 pages, and possibly supplementary sections if needed.
- Clearly indicate the names and NetIDs of all team members on the cover page of the report.
- Supplementary sections can extend the report to a maximum of 20 pages to display additional figures, tables, or results supporting your findings.

Project Report Requirement

[5 Points, 1 page] Project Description and Abstract:
- Provide a concise summary of your project, including its goal, approach, and key findings, similar to an abstract.
- Briefly mention any unique or specialized methods used in your analysis, especially if they extend beyond the techniques covered in this course. Explain how these methods differ from those reviewed in the Literature Review section.
- Include the names of all team members and their NetIDs.
- Provide a brief statement of AI usage in your project. For example, was any AI tool used in competing the code, correcting grammar errors, etc.? You should be aware that some behaviors are prohibited by the university’s academic integrity policy. For a general guideline, I would refer to this statement by Elsevier.
[10 Points, 1 page] Literature Review
- Use Google Scholar to find research papers on used car price prediction, noting that these studies may use different datasets.
- Summarize the main findings and approaches from at least two selected papers. Provide full citations and URLs for each paper.
[20 Points, 2-3 page] Data Processing and Summary Statistics:
- Many variables in this dataset are in text format. For example, the engine variable may contain information such as horsepower, the number of cylinders, and fuel type. Process these variables to extract relevant information and create new variables for analysis.
- Describe each data processing step, including your reasoning and any software packages used.
- Some variables also contain missing values. Discuss how you handled these missing values and why you chose that approach.
- After constructing your final dataset, provide a table of summary statistics (search online if you’re unfamiliar with this concept) for the key variables. You can also consider using frequency table, histogram plot, etc to summarize the variables efficiently.
- If your approach of creating new variables was inspired from the literature, include proper citations.
- You may revisit this step after the unsupervised and supervised learning phase to refine your data processing. If so, clearly explain the changes made and why they were necessary.
[15 Points, 2-3 pages] Unsupervised Learning
- Apply at least three clustering algorithms to the processed dataset.
- Determine the appropriate number of clusters and discuss the interpretability of these clusters. Do they hold any meaningful distinctions?
- Examine whether the clustering results are associated with your outcome variable.
- Summarize insights from the clustering results. How could they be useful for your supervised learning steps?
[30 Points, 3-4 pages] Prediction Models
- Implement at least five different regression models to predict the price and tune their parameters appropriately.
- You may use models beyond those covered in this course. However, at least four models should be ones introduced in class. If you incorporate a new model, provide a detailed description of it.
- All linear models (e.g., OLS) and penalized linear models (Lasso, Ridge, Elastic Net) will be treated as the same model. KNN and Nadaraya-Watson kernel estimators are treated as the same model. If you use KNN or NW Kernel Estimators, you should discuss what distance metric is used, especially for the categorical variables.
- Tune each model carefully, and clearly explain the tuning process for each. State your evaluation criteria and, if multiple criteria are used, discuss their advantages and disadvantages.
- Provide sufficient information, including tables, figures, and explanations, to illustrate the model-fitting results. Which model appears to perform best?
- Identify the variable(s) you constructed that seem to be most predictive, and provide interpretations of their impact on the model.
[10 Points, 1 page] Open-Ended Question
- A researcher is interested in estimating the original price of the cars in your dataset as if they were brand new. How would you approach this problem?
- Since your dataset lacks information on new car prices, some form of extrapolation may be necessary (but feel free to explore alternative ideas). Discuss the challenges and limitations your approach may face.
- Perform this prediction (use just one model is sufficient) by selecting three cars from your dataset and estimating their price as if they were new. Search online for the original release prices of these cars and compare these with your predictions. Discuss any discrepancies and potential reasons for these differences.
[10 Points] General requirements
- Is your report easy to read with clear logic?
- For any model fitting with subjective decisions, provide reasoning for your decisions.
- Is it written in a manner such that a reader does not need to be very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and compact enough? For example, you should not include a super long table or a huge figure that takes a whole page.
- Are irrelevant/trivial code/output hidden? Overall, no more than 2 pages of the space should be used as displaying the code.

Self-Proposed Project (Option 2) Presentations

Section: Dec 3, 11:00 - 12:20
- Predicting the level of problematic internet usage exhibited by children and adolescents based on their physical activity by Nidhi Baheti, Jyot Buch and Bhavana Sanghi. [Data Source]
- A blend of tunes and analytics, reflecting the combination of music and data analysis by Aysu Maharramli, Aadya Ranjan and Shruti Umakant Gharate. [Data Source]
- Drug recommendtation via efficacy and adverse event modeling by Amitabh Swain and Aniket Tathe. [Data Source 1] [Data Source 2]
- How Do Mental Health and Healthcare Access Shape Diabetes Risk? A Study of Understudied Populations Using Latent Profiles and Factor Analysis by Ash Sharma, Joe Li and Samyak Pokharna. [Data Source]
Dec 3, 12:30 - 1:50
- Predicting Tweet Locations: A Machine Learning Approach to Text Analysis by Ramanan Srirajan, Bochuan Zhang and Vladislav Fedorov. [Data Source]
- Effect of Early Vasopressor on mortality of sepsis shock patients by Anshika Pradhan and Alarsh Tiwari. [Data Source]
- The Impact of California’s Proposition 47 on Misdemeanor Crime in Los Angeles by Zirui Pang, Qiyang Wang and Wan Wen. [Data Source 1] Data Source 2
- Enhancing Targeted Marketing Predictions with Imbalance Adjustment Methods by Ying-Han Kao and Hanqi Tang. [Data Source]
Dec 5, 11:00 - 12:20
- Improving Popularity of Los Angeles AirBnb Properties by Nay Petrucelli and Mehmet Korkmaz. [Data Source]
- Building violations and risk preventing Chicago by Sreeman Etikyala and Viswanath Vadlamani
- Jet engine maintainence and optimization by Tanmay Shikhare and Mohini Nath. [Data Source]
- Title TBD by Zijian Wang and Jingyi Chen
Dec 5, 12:30 - 1:50
- Assessing the Causal Impact of various features to predict Loan Defaulters by Mubeen Hasan and Sarthak Morj [Data Source]
- Predicting Corporate Longevity and Bankruptcy Risk: A Survival Analysis Approach by Napaton Prasertthum, Riki Komano and Leo Yu. [Data Source]
- Music popularity prediction and the effect to mental health by Ziwei Li and Jinghang Zhou. [Data Source 1] [Data Source 2]
- A Comparative Study of Imputation Methods and Model Stacking Strategies on Credit Risk Prediction by Weifeng Liu and Mingjun Kong. [Data Source]

Final Project

Last Updated: December 2, 2024

Final Project Overview

Final Project [Option 1]: Used Car Price Prediction

Project Report Requirement

Self-Proposed Project (Option 2) Presentations

Other Announcements