Timeline
There are three assignments for the project. Their due dates are:
- Group choice: Nov 9, 11:59 PM
- Analysis Proposal: You can submit your proposal anytime between Nov 12 - 23, 11:59 PM (New deadline Nov 27, 11:59 PM)
- Final report: Dec 15, 11:59 PM
Group choice
You may form a team with no more than 3 members. You can choose to work with students from a different section (8AM/9AM). If you need help find a team/teammate, this Google Spreadsheet might help. Post your or your team’s information here to find additional memebers. Once all of your teammates agreed to this collaboration, you should do the following before Nov 9, 11:59 PM:
- Decide a team name (be creative) and choose one contact person from your team.
- The contact person will be responsible for all communications, submitting the proposal and the final report to the compass.
- Each member should go to this Google form and submit their information. The form collects:
- Your email address (must be Illinois email)
- Your name
- Your team name
- Team size
- Team contact person email (must be Illinois email)
- You should receive a confirmation email.
- You can only submit this form once. So check your info carefully.
Data Selection
You may use any dataset of your choice, so long as it contains a minimum of 500 observations and 50 variables (See FAQ for details). This dataset might be relevant to research outside of this course, another field, or some other interest of yours. If you have any questions about whether your data is appropriate, do not hesitate to ask. If you plan to use data from another endeavor of yours, such as a research project, be sure to gain permission from the controlling authority first.
The two most common sources of data used by students:
This is a repository for more advanced data, provided by Nature. However, most datasets here are large and will require an extensive data cleaning process.
Some other interesting datasets
- This MinneMUDAC Challenge dataset is an ongoing challenge that has a lot of rich data. You may consider defining your own research question.
Analysis Proposal
A proposal of your intended project is due between Nov 12 - Nov 23, 11:59 PM. It should be submitted online via Compass by your group contact member. The earlier you submit your proposal, the earlier you may receive the feedback and start to work on the final report.
A proposal of your intended project should include the following:
- The names and NetIDs of the students who will be contributing to the group project.
- A tentative title for the project.
- Description of the dataset. You need to list all the variables (input and outcome) and mention how they are relevant to the analysis goal.
- Load the data into R and print the first several observations of the data.
- Background information on the dataset, including specific citation of its source.
- The statistical learning task that the dataset will be used to accomplish (Regression or classification). You don’t need to actually perform the analysis, only need to lay out the plan.
- What are the challenges? For example, a plan of a straightforward linear regression is not suitable for this project because it does not pose any challenge. Possible interesting directions may include: very large dataset (> 500MB), very heavy computational burden, extensive data visualization, data integration of multiple sources, scientific discoveries, a complicated model that is not covered in this course, etc. There is no limitation of what you can do.
After review of the proposal (within several days of your submission), it will be evaluated in one of two ways:
- Approved - Your group may proceed with your plans for the data and project.
- Pending - We will provide suggestions, concerns, or needed information that must be addressed before the proposal will be approved.
- Rejected - the dataset is not appropriate or too trivial for the final project.
Final Report
The final report of your analysis is due by Saturday, December 15, 11:59 PM. It should be submitted online via Compass by your group contact member.
As a group, you will submit files as you would for homework which include a .pdf/.html and .Rmd file. If your data is less than 10MB, you should submit the data too. Otherwise, you should provide a shared link to your actual cleaned R
data (using, for example, google drive). Your report must contain the following:
- Project title
- Group Member Info
- Introduction and literature review (1 - 2 pages)
- Data source information (provide a link to your data source)
- You must have a comprehensive introduction of the data (no analysis) and the scientific goal
- You must review the literature (or relative sources) and report existing (possibly online) analyses results on this dataset
- Summary statistics and data visualization (less than 5 pages)
- You should provide a comprehensive summary of the data using tools we introduced
- Visualization is necessary for presenting the results
- Your proposed analysis (less than 10 pages)
- You should describe your approach and present your analysis results in a very comprehensive way
- Conclusion and discussion (1 - 2 pages)
- Summarize your scientific findings
- Address any potential pitfalls of your analysis, and discuss potential improvements (if you don’t have enough time to implement them)
Grading
Group Choice
Grading for the group choice is all-or-nothing based on summiting the google form before the deadline.
- Percent of final grade: 1%
Proposal
You will be graded on formatting, motivation, appropriateness of data, etc.
- Percent of final grade: 4%
Final Report
- Percent of final grade: 25%
A breakdown of the points for the final report (with total points of 100). Please note that these are only suggestions and minimal requirements. The instructor and TAs reserve the right for interpreting the rubrics.
- (5 points) Introduction and literature review:
- Provide enough background to the reader such that they can understand your goal without seeing the data
- (20 points) Summary statistics and data visualization:
- Is your summary statistics correct and informative
- Is your visualization of the data correct and informative
- (20 points) Use of statistical learning methodology:
- Have you used the appropriate methods for your dataset?
- Have you applied them correctly?
- (20 points) Interpretation of statistical learning methodology:
- Do you arrive at the correct conclusions from the analyses you perform?
- Do you correctly interpreting the analyses results in terms of the original scientific problem
- (5 points) Conclusion and discussion:
- Objectively summarize your findings and analysis experience
- (10 points) Use of R:
- Does your code perform the desired task?
- Is your code readable?
- (10 points) Use of R markdown:
- Are you properly utilizing R markdown to have a clean report?
- Is irrelevant code/output hidden?
- Are plots, tables, etc. properly displayed
- (10 points) General Organization, Neatness, Readability:
- Is your report easy to read with clear logic
- Is it written in a manner such that a reader does not already need to be familiar with the data?
- Bonus points (1 - 10)
- This may be rewarded to project that analyzes an extremely complicated dataset.
FAQ
This section will likely be updated as we progress through the remainder of the semester.
- On the number of variables: the requirement on the number of variables can be lowered if your data is rich enough. For example, a data that contains text may have a sentence/paragraph as one variable, however, the information in this variable is very rich. In that case, a single variable like this will satisfy the criteria.