Final Project Overview
You have two options to complete the final project. You can also watch the video recording of the discussion [here].
- [Option 1] I will provide you with a data (usually large with complex structures) and you will be asked to perform the following:
- A brief literature review
- A series of unsupervised learning, classification and regression tasks
- Present and interpret your results
- [Option 2] You can also propose a data analytic problem yourself
- The data can come from any public data repository, but it should be sufficiently large
- You need my approval. To do this, you and your team need to set up a meeting with me no later than Oct 31, 11:59 PM to discuss your analysis goal and plan before you start to work on the project.
- Between Nov 1 and the due date, you and your team needs to meet with me at least one more time to report your progress.
- Your team needs to do a 15 min presentation of your results during the last two weeks of this course.
- Your score will be determined by me, our TA, graders and whoever that is participating at the presentation.
For both options, each team can have up to three team members.
Final Project: Microbiome Data Analysis
- Timeline
- Data
- The data (
.rar
file) can be downloaded [here]. A brief data summary is given [here].
- This is a microbiome data, processed from the American Gut database at this Biocore GitHub repository. For the background of this dataset, you can read the following paper
- McDonald, Daniel, et al. “American Gut: an open platform for citizen science microbiome research.” Msystems 3.3 (2018): e00031-18. [link]
- For this project, we provide a processed dataset that contains over 9511 samples and more than 32,954 OTU (Operational Taxonomic Unit) variables at the species level. Furthermore, we have two demographic variables: Race and Sex, a continuous health outcomes: BMI (body mass index), and two categorical health outcomes: BMI category, and alcohol consumption frequency. You may not be familiar with some of these terminologies, hence it might be necessary to search and read related concepts and papers.
- An important concept you should consider is the compositional nature of microbiome data, meaning that the sum of all OTUs for each subject is 1. Some existing papers may help you understand the challenges and introduce available tools. Furthermore, the data is very sparse, meaning that there are a lot of zeros. For example, 92.83% of the variables only has less than 1% nonzero entries. The following paper provides some data analysis examples. However, don’t get swamped in the literature since your time is limited.
- Li, Hongzhe. “Microbiome, metagenomics, and high-dimensional compositional data analysis.” Annual Review of Statistics and Its Application 2 (2015): 73-94. (You should be able to download this paper from the UIUC library)
- Gloor, Gregory B., and Gregor Reid. “Compositional analysis: a valid approach to analyze microbiome high-throughput sequencing data.” Canadian journal of microbiology 62.8 (2016): 692-703. [link]
- This particular version of data is processed by Yutong Li as part of his research. Hence you should not distribute this data to other parties.
- Project Goal
- Understand the data, the background and review literature. This is a very important step before your data analysis.
- Data analysis Perform models we have learned to understand the data and predict the health outcomes provided. You are not allowed to use supervised models outside our course (for example, deep learning is prohibited). The expectation of this project is that you can understand and implement simple strategies to address the compositional structure of the data, and directly apply some existing models. However, if you choose to write your own code to implement a new model based on the literature, then you are not limited to the topics within our course.
- Presenting the results The underlying biology is very complicated, hence your findings may or may not be true. However, you could present the results to your collaborators, who are export in microbiome studies. Your collaborators have very limited knowledge in statistics and machine learning (only some understands of linear regression and clustering), hence you need to find a way to intuitively present the results to them.
- Submission Guideline
- Pick one team member as the team lead to submit the full report and supplementary files to his/her submission portal (compass2g). On the cover page of the report, you should clearly indicate all of your team members (their names and NetID).
- The full report should be in
.pdf
format. It should include a cover page and no more than 12 pages of contents. Name your file the same way as homework, e.g., Project_yourNetID.pdf
.
- Other team members should submit a one-page file (the cover page), which indicates only the members of the team and their team lead. DO NOT let multiple team members summit the same/different final reports. Doing so will lead to a 5-points penalty to all of your team members, and if those reports are different, the lowest score will be used.
Detailed Requirement
- [5 Points, 1/2 page] Project description and summary. This part should summarize your goal, approach, and conclusion.
- [10 Points, 1 page] Literature review. Read 2-5 relevant papers in the field of microbiome studies. Briefly summarize their approach and findings. Highlight any approach/idea that you borrowed from them.
- [25 Points, 2-3 pages] Unsupervised learning. The goal of unsupervised learning is usually understanding the data and if possible, discover potential clusters.
- You should only use the OTUs variables as covariates when performing this task
- You are likely to encounter problems since the data is very sparse (many zeros). Make sure to document your approach when address this problem.
- Questions you could answer: 1) What is the level of sparsity and how does that affect the clustering results? 2) Are there any underlying clusters based on OTU information? 3) You can also define your own research question to answer.
- You need to perform at least three different types of clustering methods while trying to answer these questions. Again, it is very important to present your results properly, intuitively and interpret them correctly.
- [35 Points] Supervised Learning. You should model three different outcomes: BMI, BMI (categorical), and alcohol consumption frequency (categorical).
- Incorporate strategies (either from the literature or figure something by yourself) to process the compositional data.
- Address missing data problems.
- For modeling BMI, use at least two different regression models. For example, Lasso and ridge are both treated as regression models
- For modeling BMI category and alcohol consumption frequency, use at least three different classification models
- Tuning parameters properly
- [10 Points, 1-2 pages] Your collaborator’s question. The data is very noisy and based on our current analysis, there is very little signal. Hence the collaborator concerns about whether the findings are real. Although as a data scientist, we often use cross-validation or similar approaches to justify our findings, however, such approaches may still be biased if we extensively try different models/tuning to find the best cross-validation error. Furthermore, it is difficult to justify findings from unsupervised learning approaches.
- [15 Points] General requirements
- Is your report easy to read with clear logic?
- Is it written in a manner such that a reader does not need to be very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed?
- Is irrelevant code/output hidden?
- Overall, I would expect no more than 1/4 of the space to be used as displaying the code.
Self-proposed Projects and Presentation
- Eduardo Medina-Cortina
- Randomized trials play a key role in the evaluation of social and economic programs and medical treatments. Researchers and policy makers are often interested in measurements of the impact of the treatment that go beyond the average treatment effect. In this project, I investigate the existence of heterogeneous treatment effects in a randomize control trial designed to investigate the effect of micro-credits to small business owners in Cairo, Egypt.
- [On League of Legends: Character Choices and Their Impact on Gameplay]
- Theren Williams and Eduardo Cardenas-Torres
- Chelsea M. Peterson
- Emerging approaches to making agricultural management practice recommendations for water quality improvement require highly spatially resolved data or time-intensive process-based models. My objective is to develop minimal complexity generalized additive models to predict crop yields and nutrient loads from average annual rainfall, drainage system design, and seasonal management decisions using 50 years of field-scale water quality data from across North America. I will apply the models to identify profit-maximizing management decisions and estimate the corresponding nutrient losses to waterways for a sample of farms in the dataset.
- [Statistical Learning Methods for Agricultural Nutrient Load Reduction]
- Lishen He
- Sagnik Paul
- Prediction of Ruptured Cable based on the Nodal positions of a Tensegrity Footbridge Structure. For a symmetric tensegrity footbridge structure, the nodal positions of a set of 10 nodes in one half of the structure with respect to x, y and z axes are taken before and after the rupture of 4 different cables through an experiment. The aim is to develop a model to predict the damage and the damage location in the structure based on the positions of the 10 nodes in real time. The ability to detect the damaged cable in the structure shall help in starting the damage mitigation process of the ruptured cable.
- [Prediction of Ruptured Cable based on the Nodal Position of a Tensegrity Footbridge Structure]