Final Project: Fetal Health Classification
- Timeline
- Data
- Kaggle: Fetal Health Classification
- Key words: Cardiotocograms, fetal death, fetal heart rate. Google search relevant key words could help.
- There are two main goals of this analysis.
- [Problem 1] Clustering analysis of cardiotocogram data
- The data contains 21 features extracted from cardiotocogram exams. The goal is to detect potential clusters and properly present / display them to your collaborator. The may involve understanding relevant medical terminologies of fetal death and cardiotocograms.
- [Problem 2] Supervised learning to predict
fetal_health
- This is a variable with three different classes. Hence you need to choose appropriate models or adapt existing models to properly handle multi-class problem.
- Important Notes
- Understand the data, the background and review literature. This is a very important step before your data analysis.
- Present/interpret the results. You should present and interpret the results in an intuitive way.
- Submission Guideline
- Pick one team member as the team lead to submit the full report and supplementary files to his/her submission portal (compass2g). On the cover page of the report, you should clearly indicate all of your team members (their names and NetID).
- The full report should be in
.pdf
format. It should include a cover page and no more than 12 pages of contents. Name your file the same way as homework, e.g., Project_yourNetID.pdf
.
- Other team members should submit a one-page file (the cover page), which indicates only the members of the team and their team lead. DO NOT let multiple team members summit the same/different final reports. Doing so will lead to a 5-points penalty to all of your team members, and if those reports are different, the lowest score will be used.
Project Report Requirement
- [5 Points, 1/2 page] Project description and summary. This part should summarize your goal, approach, and conclusion.
- [10 Points, 1 page] Literature review. Read two relevant papers related to cardiotocograms and fetal death. Briefly summarize their approach and findings. Highlight any approach/idea that you borrowed from them.
- [10 Points, 1-2 page] Summary Statistics. Provide a summary of your data using univariate analysis. For example, for continuous variables, what is the mean/median etc., is there any outlier? For categorical variables, what is the proportion of each category? And how do you plan to handle them in your models? Will you remove any observation/variable? aAnd for what reason? You need to provide tables and/or figures to properly display the information, and clearly document your data processing.
- [30 Points, 3-4 pages] Unsupervised learning. The goal of unsupervised learning is usually understanding the data and if possible, discover potential clusters.
- Perform at least 3 different clustering algorithms using the data. You may even choose to use different subset of variables for different algorithms. The decision is completely yours.
- Compare your findings of these three algorithms by matching your clusters to the true class
fetal_health
. However, it is not necessary that you only fit 3 clusters for each algorithm.
- Please note that there does not necessarily exist a “best” approach. And your objective is to demonstrate your findings to the readers. This should involve presenting the results using tables/figures and interpret them using words.
- [30 Points, 3-4 pages] Multi-class classification. Predict
fetal_health
status using cardiotocograms data.
fetal_health
contains three categories, hence you need to consider how to model them properly.
- You need to use three different models: random forests, SVM and another one of your choice.
- Make sure that you consider tuning parameters properly and extensively.
- You should separate your data randomly into two parts: a training data that contains 75% of the data, and a validation set that contains 25%. Your training and parameter tuning should all be done on the training data. And you should never look at the validation data before the final step, which is evaluating the performance. Make sure you set seed properly when you do this split.
- Display, interpret and compare the performances. Make sure that you clearly document your model fitting procedure and tuning using words so that other readers can understand your appraoch.
- [15 Points] General requirements
- Is your report easy to read with clear logic?
- Is it written in a manner such that a reader does not need to be very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and compact enough? For example, you should not include a super long table that takes a whole page.
- Is irrelevant code/output hidden? Overall, no more than 1/4 of the space should be used as displaying the code.