Final Project: Spaceship Titanic
- Timeline
- Release date: Oct 10th, 2022
- Due date: 11:59 PM, Thur, Dec 8th
- Data Information
- We will use the Kaggle Spaceship Titanic Prediction Competition Data
data
- Have at least one of your team members register a Kaggle account and
download the data. Before downloading the data, Kaggle will ask you to
read and accepted the rules of this competition.
- There are three files:
train.csv
, test.csv
and sample_submission.csv
. We will mainly use the training
data file. However, you may also choose to submit your result to Kaggle
to participate the competition and earn extra credit. Detailed are
provided in the report requirement section.
- Submission Guideline
- Pick one team member as the team lead to submit the full
report and supplementary files to Gradescope.
- The full report should be in
.pdf
format. It should
include a cover page and no more than 12 pages of
contents. You are allowed to leave some analysis results in the
supplementary files. However, they will not be used for grading. Name
your file the same way as homework, e.g.,
Project_yourNetID.pdf
.
- On the cover page of the report, you should clearly
indicate all of your team members (their names and NetIDs).
- Other team members should submit a one-page file (same as the cover
page), which indicates only the members of the team and their team lead.
DO NOT let multiple team members summit the
same/different final reports. Doing so will lead to a 5-points penalty
to all of your team members, and if those reports are different, the
lowest score will be used.
Project Report Requirement
- [5 Points, 1 page] Project description and summary.
This part should summarize your goal, approach, and conclusion.
Highlight any special approach that you designed/taken (including any
approaches not covered in our course), and how they are different from
the ones you reviewed (see next part).
- [10 Points, 1-2 pages] Literature review. This
Spaceship-Titanic dataset has been posted for quite some time and many
people have tried to analyze this data. There must be some interesting
ideas, and you may be able to borrow them in your project. For example,
in this discussion
post, people post their own solutions and ideas. You may also google
search relevant keywords to find other ideas. For example, this
one. Do the following:
- What is the best accuracy on this dataset so far with a known
analysis approach? Note that some teams in the Kaggle
leader board have very high accuracy, but they may not reveal their
approach. Find only the ones with published approach. Summarize their
approach.
- Read at least 2 other posts/solutions and understand their ideas.
Summarize their approach in a short paragraph and comment on their
advantages and disadvantages.
- For all sources you read, provide appropriate citations and/or
links.
- Discuss how you could utilize their approaches in your own
project.
- [20 Points, 2-3 page] Data processing and Summary
Statistics
- Some of the variables have special structures. For example, the
cabin variable is a combination of two letters and a number. Process
such variables into a format/version that your analysis algorithms could
use. Clearly describe your approach in a written paragraph.
- The original data contains missing values. Decide an approach to
address them. If you use any approaches not covered in this course,
provide a brief introduction to this approach, with proper
citations.
- After data processing, provide a frequency table or histogram plot
of all variables. Is there any outliers you want to address? Any
transformations you want to use? Discuss these issues and clearly
describe your approach if you choose to do anything to it.
- [15 Points, 2-3 pages] Unsupervised Learning
- Perform at least three clustering algorithms to the training data.
How many clusters you decide to use?
- Are these clustering results associated with your outcome
variable?
- What can you learn from the clustering results? Are they useful for
your supervised classification models?
- [35 Points, 3-4 pages] Classification Models
- Perform at least five different classification models to the
training dataset. Note that all linear and penalized linear logistic
models are considered the same model.
- Your approaches are not limited to the ones we learned in this
course. However, at least four models should be the ones we covered. If
you decide to use/propose a new approach, make sure to properly describe
your model.
- You need to tune all methods properly and clearly describe your
tuning process for each model. What is your evaluation criteria? If you
use multiple ones, discuss their pros and cons.
- You need to provide sufficient information (table/figure and
descriptions) to demonstrate the model fitting results. Which model seem
to perform the best?
- [15 Points] General requirements
- Is your report easy to read with clear logic?
- For any model fitting with subjective decisions, provide reasoning
for your decisions.
- Is it written in a manner such that a reader does not need to be
very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and
compact enough? For example, you should not include a super long table
or a huge figure that takes a whole page.
- Are irrelevant/trivial code/output hidden? Overall, no more
than 1/4 of the space should be used as displaying the
code.
- [1 - 20 Extra Points] Kaggle Competition
- You can submit your prediction on the testing data to Kaggle
- Provide a screen shot of your rankings on Kaggle
Leaderboard will learn you 1 extra points
- If your result is ever ranked among the top 50 (anytime before the
final project due) in the leader board, your report will receive 10
extra points
- If your result is ever ranked among the top 5 (anytime before the
final project due) in the leader board, your project will receive 20
extra points