#################################################################### ### American Gut Dataset for Stat 542 Project ### Sample Size: 9511 1 Unique subject id variable: sample_id 2 Demographic variables: Race/Sex 4 Response variables: -Regression: BMI, Weight -Classification: BMI_category, Alcohol_comsumption_frequency -The category "not provided" indicates missing values, large BMIs are also considered as missing data. 32954 Species level microbes from fecal samples: -Features are compositional -Most features are NOT SPECIFIED at the species level ("-unspecified" at the end of feature name). This requires data cleaning to uniquify the duplicated unspecified species features. -Feature Density Table: 0% <1% 1%-10% 10%-50% >50% no. of features 2.0000 32113.0000 1770.0000 504.0000 204.0000 % of features 0.0058 92.8309 5.1166 1.4569 0.5897 i.e. ~5.1%(1770) of the features are present in 1% - 10% of the subjects.