🍽️ Can Nutrition Predict Recipe Ratings?
Author: Lucero Toral
Introduction
In this report, I will be analyzing data of recipes and their respective ratings from food.com. The scraped data contains relevant information about the recipes such as preparation time, relevant tags, nutritional information (total calories, total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV)), recipe steps, and a user-submitted description.
In addition, the scraped data also includes user rating information for each recipe and user-submitted reviews.
The size of each dataset is as follows:
- Recipes: 83,782 rows Ă— 12 columns
- Interactions: 731,927 rows Ă— 5 columns
Having gained an interest in cooking, I am most interested in exploring the nutritional relationship between recipes, ingredients, and user ratings.
In this analysis, I will attempt to answer the question, “Can we predict how favorable a recipe is by its nutritional information?” This is done by predicting user ratings using the nutritional and tag features of the datasets.
Data Cleaning and Exploratory Data Analysis
Data Cleaning
The datasets are cleaned by merging them by recipe ID and by calculating the average rating for each recipe.
Name | ID | Time (min) | Contrib ID | Tags | Nutrition | # Steps | Steps (First 100 chars) | Description (First 100 chars) | Ingredients (First 100 chars) | # Ingred | User ID | Recipe ID | Date | Rating | Review (First 100 chars) |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 Brownies | 333281 | 40 | 985201 | Desserts, Snacks | [138, 10, 50…] | 10 | Heat oven to 350F, line baking dish… | The most chocolatey, moist, fudgy… | Chocolate, Butter, Eggs, Sugar… | 9 | 386585 | 333281 | 2008-11-19 | 4 | These were pretty good, but took f… |
Canada Cookies | 453467 | 45 | 1848091 | Cookies, Canada | [595, 46, 211…] | 12 | Preheat oven, sift flours, blend… | Best school cafeteria cookies… | Sugar, Butter, Eggs, Vanilla… | 11 | 424680 | 453467 | 2012-01-26 | 5 | Originally I was gonna cut the re… |
Broccoli Casserole | 306168 | 40 | 50969 | Casserole, Side | [194, 20, 6…] | 6 | Preheat oven to 350, mix all ingr… | Broccoli casserole inspired by… | Broccoli, Cheese, Onions, Milk… | 9 | 29782 | 306168 | 2008-12-31 | 5 | This was one of the best broccoli… |
Pound Cake | 286009 | 120 | 461724 | Cakes, Dessert | [878, 63, 326…] | 7 | Grease pan, cream butter, add flo… | Super rich Southern pound cake… | Butter, Sugar, Eggs, Flour… | 7 | 678234 | 286009 | 2009-05-13 | 5 | Why a millionaire pound cake? Be… |
Meatloaf | 475785 | 90 | 2202916 | Dinner, Meat | [267, 30, 12…] | 17 | Pan fry bacon, mince onion, mix … | Mediterranean-inspired meatloaf… | Meat, Bacon, Cheese, Onions… | 13 | 998231 | 475785 | 2012-03-06 | 5 | Ready, set, cook! Special edition… |
Univariate Analysis
In the first figure, we see the distribution of the number of ingredients in the recipes. This represents the complexity of each recipe, which may influence its perceived flavor profile, as well as the overall nutritional values of each recipe.
The second figure shows the distribution of average user rating for the recipes. Note that distribution is skewed far to the higher rating, while having a notable amount of zero ratings. This distibution may suggest that the lowest rated recipes may have distinct features within the data.
Bivariate Analysis
For the third figure, I identify recipes that contain the tag of “healthy” and plot their average rating distributions using violin plots. As we can see, the proportion of 5-star reviews within the data without a “healthy” tag is slightly higher than the distribtion with the tag. This may suggest a high proportion of sweets, desserts, fried, or other recognonized unhealthy foods contributing to high ratings.
Interesting Aggregates
is_healthy | count | mean | std | min | 25% | 50% | 75% | max |
---|---|---|---|---|---|---|---|---|
False | 70051 | 4.3607 | 1.09391 | 0 | 4 | 5 | 5 | 5 |
True | 13730 | 4.32785 | 1.08369 | 0 | 4 | 5 | 5 | 5 |
In the table above, I grouped the rating statistics by whether the recipe is tagged “healthy”. The tagged “healthy” recipes have a slighty lower mean rating with a smaller standard deviation to the recipes without the tag.
rating_high_low | calories | protein | sugar |
---|---|---|---|
low | 444.786 | 34.5054 | 71.6494 |
high | 423.548 | 32.5444 | 67.3829 |
Here, I grouped the data by “high” and “low” average ratings and show the average nutritional values of calories, protein, and sugar. The “high” ratings correspond to values of >=4, while the “low” ratings correpond to <4. The lower rated recipes demonstrate slightly higher average calories and sugar level than the highly rated recipes
Imputation
Feature | Missing Values |
---|---|
name |
1 |
id |
0 |
minutes |
0 |
contributor_id |
0 |
submitted |
0 |
tags |
0 |
nutrition |
0 |
n_steps |
0 |
steps |
0 |
description |
114 |
ingredients |
0 |
n_ingredients |
0 |
user_id |
1 |
recipe_id |
1 |
date |
1 |
rating |
1 |
review |
58 |
This table shows the number of missing values for each column in our combined dataset. We are primarily interest in minutes, tags, nutrition, and rating features.
Since there is only one missing data point in the ratings column, imputation isn’t necessary. However, for simplicity we will mean impute the single missing data.
Framing a Prediction Problem
We will train a model to predict recipe rating using the information of nutrition, preparation time, submitted tags, and # of steps from the datasets.
This a regression problem due to the recipes’ average ratings being continous between 0 and 5. (i.e. scores containing off integer values 4.33, 2.57, etc…)
The evaluation metric used is MAE (Mean Absolute Error), this was chosen over other metrics due to its robustness to outliers and easy interpretation.
Baseline Model
For a baseline model, we train a Linear Regression model that takes in quantitative and nominal features to predict the average user rating.
The selected quantitative features are:
Feature | Description |
---|---|
minutes |
Time (in minutes) to prepare the recipe. |
calories |
Total calories in the recipe. |
total_fat |
Total fat content in the recipe. |
sugar |
Total sugar content in the recipe. |
sodium |
Total sodium content in the recipe. |
protein |
Total protein content in the recipe. |
saturated_fat |
Total saturated fat content. |
carbohydrates |
Total carbohydrate content in the recipe. |
These features represent the preparation complexity and nutrional information of the recipes.
The selected catagorical feature:
Feature | Description |
---|---|
tags_str |
List of tags (like “healthy”, “lunch”, etc.) |
No ordinal features were selected.
The quantitative features were preprocessed with scikit’s StandardScaler, while the catagorical features are encoded with OneHotEncoder, to represent the tags as binary features.
The model performed with a MAE score of 0.760, meaning the prediction is on average off by 0.760. Since the potential score range of 0 - 5, the model is not terribly off but considering Figure 2 that majority ratings are between 4 and 5, the model may not be identifying the lower scores.
Final Model
For the final model, we engineered the following features:
Feature Name | How It Was Created | Why It’s Useful |
---|---|---|
log_minutes |
log1p(minutes) (log-transformed) |
Log transformation reduces skew |
sodium_calorie_ratio |
sodium / (calories + 1e-5) |
How much sodium there is relative to total calories, which may influence taste (salty food can have higher or lower ratings). |
total_fat_calorie_ratio |
total_fat / (calories + 1e-5) |
Fatty foods like desserts may be rated differently than leaner recipes. |
sugar_calorie_ratio |
sugar / (calories + 1e-5) |
Tracks how much of the recipe’s calories come from sugar. |
protein_calorie_ratio |
protein / (calories + 1e-5) |
Protein may be important to health-conscious users. |
saturated_fat_calorie_ratio |
saturated_fat / (calories + 1e-5) |
Helps identify recipes with a high proportion of “bad fats” |
carbohydrates_calorie_ratio |
carbohydrates / (calories + 1e-5) |
Indicates how “carb-dense” the recipe is. (e.g., high-carb desserts vs. low-carb meals). |
The preparation time feature was log scaled to reduce the influence of significant outliers.
The nutritional features were all scaled with the recipe’s total calories. This was chosen because total calories is correlated to nutritional values for each recipe. Therefore, these nutrition ratios provide a better metric of the healthiness each recipe. It will also help highlight the percieved tastes of the recipes, as “saltier” foods would have a higher sodium ratio, “sweeter” foods have a higher sugar to calorie ratio.
We choose to implement the LASSO regression, due to the many catagorical tag features from the OnehotEncoding. LASSO should learn to disregard unimportant features from irrelevant tags.
We performed a gridsearch of the alpha hyperparameter, with the possible values of [0.01, 0.1, 1]. In addition, we perform cross-validation with 3-folds to validate the optimal alpha value.
The hyperparameter that performed best was alpha = 0.01, which demonstrated an MAE value of 0.756.
The final model with the additional feature engineering is a small improvement over the base line. (MAE 0.756 vs. MAE 0.760) Further improvement to the existing model can be done by gridsearching over more hyperparameters and increasing the max iterations from the default 1000 intereations.
Additionally, we have assumed a linear relationship to predict the average ratings. Training a non-linear model may produce better results.