Project Overview
- Timeline
- Data
- Download the
county level COVID19 data
(posted at Apr 19) [link removed]. The data provides COVID19 information up to Apr 15.
- New (final) Version of the data (posted at Apr 23) [here], the data contains information up to Apr 22
- This is part of the GitHub data processed and constructed Prof. Bin Yu’s group at Berkeley. It contains both
county level
and hospital level
data. However, we only use the county level data. You should carefully read their documentation, especially this readme file and their paper.
- Project Goal
- Understand the data, the background and review literature. Sometimes this is the most important step in data analysis.
- Data analysis. Perform models we have learned to understand the data and predict the number of infected people and the death count. Please note that this is a longitudinal data. However, you are not allowed to use models outside our course. You are required to use simple strategies to model the time-dependent outcomes and predict them at a future time point.
- What should we do? As your collaborator, I am not interested in your prediction accuracy. Instead, I am interested in two realistic questions: 1). what population (the definition is up to you) is the most vulnerable to this virus? 2). what could we do to reduce mortality? Please note that both answers need to be supported by your analysis results. However, to convince your collaborator, who knows little about machine learning methods, you need to focus on interpretation, presentation, science, and logic.
Submission Guideline
- Pick one team member as the team lead to submit the full report and supplementary files to his/her submission portal (compass2g). On the cover page of the report, you should clearly indicate all of your team members (their names and NetID).
- The full report should be in
.pdf
format. It should include a cover page and no more than 12 pages of contents. Name your file the same way as homework, e.g., Project_yourNetID.pdf
.
- Other team members should submit a one-page file (the cover page), which indicates only the members of the team and their team lead. DO NOT let multiple team members summit the same/different final reports. Doing so will lead to a 5-points penalty to all of your team members, and if those reports are different, the lowest score will be used.
Detailed Requirement
Your report should contain the following:
- [10 Points, 1/2 page] Project description and summary. This part should summerise your goal, approach, and conclusion.
- [5 Points, 1/2 page] Litureture review. Read the paper by Prof. Bin Yu’s group [link] and possibly related papers. Briefly summarize their research goal and approach. Please note that you are not required to implement their approach, but you can always borrow their ideas if it is relevant. Also, keep in mind that we are only using a subset of their data.
- [20 Points, 2-3 pages] Unsupervised learning. The goal of unsupervised learning is usually understanding the data.
- You need to first address potential missing data problems before performing unsupervised learning. Please make sure to document your approach.
- You can treat any information contained in this dataset as covariates when performing this task. This includes both the demographics and health-related information, and the COVID19 related death/count.
- Questions you could consider: 1) Are there any underlying clusters of counties based on the demographics and health-related information? 2) is there an underlying pattern (at the county level) in terms of how the COVID19 counts are growing? 3) is there any potential association between the two? 4) You can also define your own research question to answer.
- You need to perform at least three different types of clustering methods overall while trying to answer these questions.
- [50 Points] Supervised Learning.
- [15 Points, 1-2 pages] Classification problem. Define a class variable:
Death per 100,000 population > 1
. This can be done using tot_deaths/PopulationEstimate2018
. Perform at least two different classification models to model this outcome using the demographics and health-related information. Grading of this question will be based on:
- Whether the methods are implemented correctly
- Whether tuning parameters are considered and properly tuned
- Whether the results are presented properly
- [35 Points, 4-5 pages] Regression problem. You need to figure out an approach (whether to borrow ideas from their paper or to come up your own) to predict the number of death one week from Apr 22 (the final version of this data). Please note that by the time that this report is due, we should already have information on the death and infection counts of Apr 29. Since this is a developing situation, you are allowed to use additional covariate information to assist your prediction; however, this is not required. Specific cases should be discussed on Piazza or during office hours with me. However, the true prediction accuracy of the Apr 30 data will not be the main judging criteria of your report. You should consider at least three different regression models in this question. Grading will be based on:
- Whether you presented your approach clearly and implemented them accurately.
- Whether tuning parameters are considered and properly tuned
- Whether the results are presented properly
- [5 Bonus Points] Acquire updated information at Apr 29 (if available). Validate your model and discuss if any improvement can be made. If you choose to do so, your approach and result should be clearly stated and presented.
- [15 Points, 1-2 pages] Your collaborator’s question. Answer the collaborator’s two questions based on your analysis results. Your answer should be based mainly on your previous analysis results, or additional analysis if you choose to do so. You can also borrow results from the literature to support your arguments. Grading will be based on:
- What is your approach to identifying this sub-population? Whether you presented your approach clearly and implemented them accurately.
- Whether the results are presented properly, and the argument is logical.
- General requirements. Please note that general organization, neatness, readability, and use of R consist of 20% of your total score (same as the homework).
- Is your report easy to read with clear logic?
- Is it written in a manner such that a reader does not need to be very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed?
- Are you properly utilizing R markdown?
- Is irrelevant code/output hidden?
- Overall, I would expect no more than 1/4 of the space to be used as displaying the code.
Further clarifications and Important Updates
Other questions, if any, will be clarified here:
- One week from the final version of the data is Apr 29. There was a type when stating the the date of the final version.
- We provided several possibilities of “questions” you could consider, however, it is not asking you to answer all these questions. You need to find some questions to answer, but as long as you perform a total of 3 clustering methods, it fine. The key requirement is to demonstrate the results to answer the question you picked.
- Our data is updated. Currently, the final version is information up to Apr 22.
- Our initial target of prediction is 2 weeks, however, that is probably not realistic. Hence we changed that to 1 week.