Timeline

There are three assignments for the project. Their due dates are:

Group choice

You may form a team with no more than 3 members. You can choose to work with students from a different section (8AM/9AM). If you need help find a team/teammate, this Google Spreadsheet might help. Post your or your team’s information here to find additional memebers. Once all of your teammates agreed to this collaboration, you should do the following before Nov 9, 11:59 PM:

Data Selection

You may use any dataset of your choice, so long as it contains a minimum of 500 observations and 50 variables (See FAQ for details). This dataset might be relevant to research outside of this course, another field, or some other interest of yours. If you have any questions about whether your data is appropriate, do not hesitate to ask. If you plan to use data from another endeavor of yours, such as a research project, be sure to gain permission from the controlling authority first.

The two most common sources of data used by students:

This is a repository for more advanced data, provided by Nature. However, most datasets here are large and will require an extensive data cleaning process.

Some other interesting datasets

Analysis Proposal

A proposal of your intended project is due between Nov 12 - Nov 23, 11:59 PM. It should be submitted online via Compass by your group contact member. The earlier you submit your proposal, the earlier you may receive the feedback and start to work on the final report.

A proposal of your intended project should include the following:

After review of the proposal (within several days of your submission), it will be evaluated in one of two ways:

Final Report

The final report of your analysis is due by Saturday, December 15, 11:59 PM. It should be submitted online via Compass by your group contact member.

As a group, you will submit files as you would for homework which include a .pdf/.html and .Rmd file. If your data is less than 10MB, you should submit the data too. Otherwise, you should provide a shared link to your actual cleaned R data (using, for example, google drive). Your report must contain the following:

  1. Project title
  2. Group Member Info
  3. Introduction and literature review (1 - 2 pages)
    • Data source information (provide a link to your data source)
    • You must have a comprehensive introduction of the data (no analysis) and the scientific goal
    • You must review the literature (or relative sources) and report existing (possibly online) analyses results on this dataset
  4. Summary statistics and data visualization (less than 5 pages)
    • You should provide a comprehensive summary of the data using tools we introduced
    • Visualization is necessary for presenting the results
  5. Your proposed analysis (less than 10 pages)
    • You should describe your approach and present your analysis results in a very comprehensive way
  6. Conclusion and discussion (1 - 2 pages)
    • Summarize your scientific findings
    • Address any potential pitfalls of your analysis, and discuss potential improvements (if you don’t have enough time to implement them)

Grading

Group Choice

Grading for the group choice is all-or-nothing based on summiting the google form before the deadline.

  • Percent of final grade: 1%

Proposal

You will be graded on formatting, motivation, appropriateness of data, etc.

  • Percent of final grade: 4%

Final Report

  • Percent of final grade: 25%

A breakdown of the points for the final report (with total points of 100). Please note that these are only suggestions and minimal requirements. The instructor and TAs reserve the right for interpreting the rubrics.

  • (5 points) Introduction and literature review:
    • Provide enough background to the reader such that they can understand your goal without seeing the data
  • (20 points) Summary statistics and data visualization:
    • Is your summary statistics correct and informative
    • Is your visualization of the data correct and informative
  • (20 points) Use of statistical learning methodology:
    • Have you used the appropriate methods for your dataset?
    • Have you applied them correctly?
  • (20 points) Interpretation of statistical learning methodology:
    • Do you arrive at the correct conclusions from the analyses you perform?
    • Do you correctly interpreting the analyses results in terms of the original scientific problem
  • (5 points) Conclusion and discussion:
    • Objectively summarize your findings and analysis experience
  • (10 points) Use of R:
    • Does your code perform the desired task?
    • Is your code readable?
  • (10 points) Use of R markdown:
    • Are you properly utilizing R markdown to have a clean report?
    • Is irrelevant code/output hidden?
    • Are plots, tables, etc. properly displayed
  • (10 points) General Organization, Neatness, Readability:
    • Is your report easy to read with clear logic
    • Is it written in a manner such that a reader does not already need to be familiar with the data?
  • Bonus points (1 - 10)
    • This may be rewarded to project that analyzes an extremely complicated dataset.

FAQ

This section will likely be updated as we progress through the remainder of the semester.