Final Project
- Timeline
- Release date: Oct 13
- Due date: 11:59 PM, Fri. Dec 10th
- Data Information
- We will use the Kaggle BRCA Multi-Omics (TCGA) data
- Download the
brca_data_w_subtypes.csv
dataset
- The data contains 705 observations, 1936 variables (860 copy number variations, 249 mutations, 604 gene expressions and 223 protein levels), and 5 outcomes (
vital.status
, PR.Status
, ER.Status
, HER2.Final.Status
, and histological.type
)
- We will discard the
vital.status
variable.
PR.Status
, ER.Status
, and HER2.Final.Status
are determined using immunohistochemistry scoring. For these variables, we will only consider two levels: “Positive” and “Negative”. For histological.type
, we will only consider “infiltrating lobular carcinoma” and “infiltrating ductal carcinoma”. You can treat all other categories as missing values. Hence, all four outcomes should be binary.
- The goal of this project is to use the 1936 genetic markers to predict these four cancer status.
- A background of this dataset can be found here
- Information about breast cancer subtypes defined using 50 gene expressions (
PAM50
) can be found in this famous article
- Submission Guideline
- Pick one team member as the team lead to submit the full report and supplementary files to Gradescope.
- The full report should be in
.pdf
format. It should include a cover page and no more than 12 pages of contents. Name your file the same way as homework, e.g., Project_yourNetID.pdf
.
- On the cover page of the report, you should clearly indicate all of your team members (their names and NetID).
- Other team members should submit a one-page file (same as the cover page), which indicates only the members of the team and their team lead. DO NOT let multiple team members summit the same/different final reports. Doing so will lead to a 5-points penalty to all of your team members, and if those reports are different, the lowest score will be used.
Project Report Requirement
- [5 Points, 1 page] Project description and summary. This part should summarize your goal, approach, and conclusion.
- [10 Points, 1 page] Literature review. Read two more relevant papers (in addition to the two provided in the data information) about breast cancer subtype identifications. Briefly summarize their approach and findings. Highlight any approach/idea that you borrowed from them.
- [10 Points, 1-2 page] Summary Statistics and data processing.
- Provide a summary of your data using univariate analysis. Please note that this is not asking you to print pages of summery statistics. Only provide essential information.
- For example, for continuous predictors, is there any outlier/missing value? Do you need to do any transformations?
- For categorical predictors, do you need to deal with variables that are extremely unbalanced?
- Any variable/observation you decided to remove from the analysis? And for what reason?
- You need to provide tables and/or figures to properly display the information to support your decision and clearly document your processing steps.
- [20 Points, 2-3 pages] Modeling
PR.Status
- Build a classification model to predict
PR.Status
. Use classification error as the evaluation criterion.
- You should use at least two different approaches for this task. For details, please see the additional information section.
- You need to provide sufficient information (table, figure and descriptions) to demonstrate the model fitting results
- [20 Points, 2-3 pages] Modeling
histological.type
- Build a classification model to predict
histological.type
. Use AUC as the evaluation criterion.
- You should use at least two different approaches for this task and they should be different from the
PR.Status models
. For details, please see the additional information section.
- You need to provide sufficient information (table, figure and descriptions) to demonstrate the model fitting results
- [20 Points, 2-3 pages] Variable selection for all outcomes
- The goal of this part is to address a practical question: can we select a small set of biomarkers that can accurately predict all four outcomes?
- You need to select a total of 50 variables, and build models using only these variables to predict all four outcomes.
- The evaluation criteria is based on a three-fold cross-validation with AUC for each outcome, and then average the cross-validated AUC of all four outcomes. And you must generate the fold ID (for all 705 observations) using the following code:
set.seed(1); sample(1:3, 705, replace = TRUE)
. Please note that even you decide to remove any observation for a particular outcome, the fold ID for the remaining observations will not change.
- The cross-validation AUC value will be used to compare across the entire class. 5% of the total score will be assigned based on the result of this competition.
- You can consider reading relevant papers for this task and help guide your variable selection procedure. This means that your final model does not need to be completely data driven. It can be partially knowledge driven. If you do so, please clearly document your procedure, and you should also mention them in the literature review.
- [15 Points] General requirements
- Is your report easy to read with clear logic?
- Is it written in a manner such that a reader does not need to be very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and compact enough? For example, you should not include a super long table or a huge figure that takes a whole page.
- Are irrelevant/trivial code/output hidden? Overall, no more than 1/4 of the space should be used as displaying the code.
Self-proposed Project Presentation (Fall 2021)
- Zoom Link (same as Tue afternoon office hour)
- Dec 1, 10:30 - 11:45AM [Day 1 Video Recording]
- Title: Penalized Regression for Dietary Patterns Extraction/Observation weights considerations
- by Christian Maino Vieytes
- Title: Classification of Scientific Research Articles into Topics and Subtopics
- by Manish Kumar and Kumar Shubham
- Title: We Hear You: How do State Media Pay Attention to Online Public Opinion?
- by Lucie Lu and Xiaofan Shen
- Title: Salary prediction from college data
- Dec 2, 7 - 8:30PM [Day 2 Video Recording]
- House Price Prediction in Ames, Iowa with Regression
- by Ruoqi Song, Jianing Shen and Tongyao Zhu
- Cryptocurrency prediction using limit order book
- Predict gun violence probability and classify its cause
- by Tian Ni, Yizhe He and Erh Hsuan Wang
- Dec 6, 3 - 4:30PM [Day 3 Video Recording]
- Title: Popularity prediction of online articles
- Chanyeong Choi, Aakansha Singh and Yizhen Jia
- Title: Classify hand gesture in sign language to letters in English language
- Sharvi Tomar, Shubham Mehta and Anushree
- Title: Classification of Tones Produced in Quiet and Noise
- Title: Family influence on education expectation
- Yiyu Liu and Tianying Cai
- Title: To predict the probability of transaction being fraudulent
- Sriyella Marreddy, Umesh Karamchandani and Maulishree Gupta