Project Overview
Self-proposed project presentations
Final Project: Food Data Analysis
- Timeline
- Data
- Analysis Goal
- There are two main goals of this analysis. For each problem, please read very carefully regarding which variables should be used for the task.
- [Problem 1] Clustering analysis of food nutrition
- For this analysis, we are mainly using the nutrient variables to form clusters and interpret/validate the clustering results with other variables
- Nutrient variables have their names ended with
_100g
. Two exceptions, as far as I know, are nutrition_score_fr_100g
and nutrition_score_fr_100g
- You do not have to use all nutrient variables, and this depends on your research question. For example, I might be only interested in variables related to fat, then I could restrict my analysis to a certain subset.
- To validate your clustering results you need to define your own research question. For example, my research question is to understand the heterogeneity of fat intake across different countries. Then I should compare the clustering results with that country indicator.
- [Problem 2] Supervised learning to predict nutrition scores using text information
- For this problem, you need to use text information as the predictors. There are a lot of text information, such as
additives
, ingredients_text
, and a variety of labels. You have the freedom to choose whatever text information, as long as you include both additives
, ingredients_text
. However, please note that nutrition_grade_fr
should not be used because it is directly related to the outcome. I might miss other variables like this. If so, they will be announced at CompassWire.
- How to extract predictors from text information is a key step in this analysis. You need to perform a brief literature review on this topic.
- The outcome variable is
nutrition_score_fr_100g
(this should be almost identical to nutrition_score_uk_100g
, so we will just use one). Choose either regression or classification.
- Important Notes
- Understand the data, the background and review literature. This is a very important step before your data analysis, especially for the text part
- This data is very large and may pose computational challenge. You are allowed to consider a subset of this data for your analysis. However, the training dataset should contain no less than 10K observations. You should also separate out a validation data. This data should not be used for training or tuning parameters, but only for evaluating your models.
- Missing values can be another challenge, depending on your training data. Address them if needed.
- Present/interpret the results. Since everyone should be able to understand the the meaning of these variables, you should present and interpret the results in an intuitive way.
- Submission Guideline
- Pick one team member as the team lead to submit the full report and supplementary files to his/her submission portal (compass2g). On the cover page of the report, you should clearly indicate all of your team members (their names and NetID).
- The full report should be in
.pdf
format. It should include a cover page and no more than 12 pages of contents. Name your file the same way as homework, e.g., Project_yourNetID.pdf
.
- Other team members should submit a one-page file (the cover page), which indicates only the members of the team and their team lead. DO NOT let multiple team members summit the same/different final reports. Doing so will lead to a 5-points penalty to all of your team members, and if those reports are different, the lowest score will be used.
Project Report Requirement
- [5 Points, 1/2 page] Project description and summary. This part should summarize your goal, approach, and conclusion.
- [10 Points, 1-2 page] Literature review. Read 1-2 relevant papers related to Open Food Facts and 2-3 papers related to text data analysis. Briefly summarize their approach and findings. Highlight any approach/idea that you borrowed from them.
- [10 Points, 1-2 page] Data Processing and Summary Statistics. Provide all detailed information about how your data is processed. This includes subsetting the data, missing values, text data, etc. Describe how your training and validation data are defined. After processing the data, (selectively) provide summary statistics about your variables.
- [30 Points, 3-4 pages] Unsupervised learning. The goal of unsupervised learning is usually understanding the data and if possible, discover potential clusters.
- Read [Problem 1] carefully.
- Define your research question. However, you are not allowed to use the “fat intake across countries” example that I provided.
- You need to consider at least 3 different clustering algorithms and compare their performances.
- Please note that there does not necessarily exist a “best” approach. And your objective is to demonstrate your findings to the readers. This should involve presenting the results using tables/figures and interpret them using words.
- [30 Points, 3-4 pages] Supervised Learning. You should model
nutrition_score_fr_100g
using (only) text information which describes ingredients and processing of the food.
- Read [Problem 2] carefully.
- The
nutrition_score_fr_100g
is a ordinal variable, so you can consider either regression or classification (or both), as long as you consider at least 3 different supervised algorithms. For classification, you may also dichotomize it or define several categories.
- Tune parameters properly. Display, interpret and compare their performances.
- [15 Points] General requirements
- Is your report easy to read with clear logic?
- Is it written in a manner such that a reader does not need to be very familiar with the data?
- Are plots, tables, etc. informative and correctly displayed and compact enough? For example, you should not include a super long table that takes a whole page.
- Is irrelevant code/output hidden? Overall, no more than 1/4 of the space should be used as displaying the code.