Skip to the content.

🍽️ Can Nutrition Predict Recipe Ratings?

Author: Lucero Toral


Introduction

In this report, I will be analyzing data of recipes and their respective ratings from food.com. The scraped data contains relevant information about the recipes such as preparation time, relevant tags, nutritional information (total calories, total fat (PDV), sugar (PDV), sodium (PDV), protein (PDV), saturated fat (PDV)), recipe steps, and a user-submitted description.

In addition, the scraped data also includes user rating information for each recipe and user-submitted reviews.

The size of each dataset is as follows:

Having gained an interest in cooking, I am most interested in exploring the nutritional relationship between recipes, ingredients, and user ratings.

In this analysis, I will attempt to answer the question, “Can we predict how favorable a recipe is by its nutritional information?” This is done by predicting user ratings using the nutritional and tag features of the datasets.

Data Cleaning and Exploratory Data Analysis

Data Cleaning

The datasets are cleaned by merging them by recipe ID and by calculating the average rating for each recipe.

Name ID Time (min) Contrib ID Tags Nutrition # Steps Steps (First 100 chars) Description (First 100 chars) Ingredients (First 100 chars) # Ingred User ID Recipe ID Date Rating Review (First 100 chars)
1 Brownies 333281 40 985201 Desserts, Snacks [138, 10, 50…] 10 Heat oven to 350F, line baking dish… The most chocolatey, moist, fudgy… Chocolate, Butter, Eggs, Sugar… 9 386585 333281 2008-11-19 4 These were pretty good, but took f…
Canada Cookies 453467 45 1848091 Cookies, Canada [595, 46, 211…] 12 Preheat oven, sift flours, blend… Best school cafeteria cookies… Sugar, Butter, Eggs, Vanilla… 11 424680 453467 2012-01-26 5 Originally I was gonna cut the re…
Broccoli Casserole 306168 40 50969 Casserole, Side [194, 20, 6…] 6 Preheat oven to 350, mix all ingr… Broccoli casserole inspired by… Broccoli, Cheese, Onions, Milk… 9 29782 306168 2008-12-31 5 This was one of the best broccoli…
Pound Cake 286009 120 461724 Cakes, Dessert [878, 63, 326…] 7 Grease pan, cream butter, add flo… Super rich Southern pound cake… Butter, Sugar, Eggs, Flour… 7 678234 286009 2009-05-13 5 Why a millionaire pound cake? Be…
Meatloaf 475785 90 2202916 Dinner, Meat [267, 30, 12…] 17 Pan fry bacon, mince onion, mix … Mediterranean-inspired meatloaf… Meat, Bacon, Cheese, Onions… 13 998231 475785 2012-03-06 5 Ready, set, cook! Special edition…

Univariate Analysis

In the first figure, we see the distribution of the number of ingredients in the recipes. This represents the complexity of each recipe, which may influence its perceived flavor profile, as well as the overall nutritional values of each recipe.

The second figure shows the distribution of average user rating for the recipes. Note that distribution is skewed far to the higher rating, while having a notable amount of zero ratings. This distibution may suggest that the lowest rated recipes may have distinct features within the data.

Bivariate Analysis

For the third figure, I identify recipes that contain the tag of “healthy” and plot their average rating distributions using violin plots. As we can see, the proportion of 5-star reviews within the data without a “healthy” tag is slightly higher than the distribtion with the tag. This may suggest a high proportion of sweets, desserts, fried, or other recognonized unhealthy foods contributing to high ratings.

Interesting Aggregates

is_healthy count mean std min 25% 50% 75% max
False 70051 4.3607 1.09391 0 4 5 5 5
True 13730 4.32785 1.08369 0 4 5 5 5

In the table above, I grouped the rating statistics by whether the recipe is tagged “healthy”. The tagged “healthy” recipes have a slighty lower mean rating with a smaller standard deviation to the recipes without the tag.

rating_high_low calories protein sugar
low 444.786 34.5054 71.6494
high 423.548 32.5444 67.3829

Here, I grouped the data by “high” and “low” average ratings and show the average nutritional values of calories, protein, and sugar. The “high” ratings correspond to values of >=4, while the “low” ratings correpond to <4. The lower rated recipes demonstrate slightly higher average calories and sugar level than the highly rated recipes

Imputation

Feature Missing Values
name 1
id 0
minutes 0
contributor_id 0
submitted 0
tags 0
nutrition 0
n_steps 0
steps 0
description 114
ingredients 0
n_ingredients 0
user_id 1
recipe_id 1
date 1
rating 1
review 58

This table shows the number of missing values for each column in our combined dataset. We are primarily interest in minutes, tags, nutrition, and rating features.

Since there is only one missing data point in the ratings column, imputation isn’t necessary. However, for simplicity we will mean impute the single missing data.

Framing a Prediction Problem

We will train a model to predict recipe rating using the information of nutrition, preparation time, submitted tags, and # of steps from the datasets.

This a regression problem due to the recipes’ average ratings being continous between 0 and 5. (i.e. scores containing off integer values 4.33, 2.57, etc…)

The evaluation metric used is MAE (Mean Absolute Error), this was chosen over other metrics due to its robustness to outliers and easy interpretation.

Baseline Model

For a baseline model, we train a Linear Regression model that takes in quantitative and nominal features to predict the average user rating.

The selected quantitative features are:

Feature Description
minutes Time (in minutes) to prepare the recipe.
calories Total calories in the recipe.
total_fat Total fat content in the recipe.
sugar Total sugar content in the recipe.
sodium Total sodium content in the recipe.
protein Total protein content in the recipe.
saturated_fat Total saturated fat content.
carbohydrates Total carbohydrate content in the recipe.

These features represent the preparation complexity and nutrional information of the recipes.

The selected catagorical feature:

Feature Description
tags_str List of tags (like “healthy”, “lunch”, etc.)

No ordinal features were selected.

The quantitative features were preprocessed with scikit’s StandardScaler, while the catagorical features are encoded with OneHotEncoder, to represent the tags as binary features.

The model performed with a MAE score of 0.760, meaning the prediction is on average off by 0.760. Since the potential score range of 0 - 5, the model is not terribly off but considering Figure 2 that majority ratings are between 4 and 5, the model may not be identifying the lower scores.

Final Model

For the final model, we engineered the following features:

Feature Name How It Was Created Why It’s Useful
log_minutes log1p(minutes) (log-transformed) Log transformation reduces skew
sodium_calorie_ratio sodium / (calories + 1e-5) How much sodium there is relative to total calories, which may influence taste (salty food can have higher or lower ratings).
total_fat_calorie_ratio total_fat / (calories + 1e-5) Fatty foods like desserts may be rated differently than leaner recipes.
sugar_calorie_ratio sugar / (calories + 1e-5) Tracks how much of the recipe’s calories come from sugar.
protein_calorie_ratio protein / (calories + 1e-5) Protein may be important to health-conscious users.
saturated_fat_calorie_ratio saturated_fat / (calories + 1e-5) Helps identify recipes with a high proportion of “bad fats”
carbohydrates_calorie_ratio carbohydrates / (calories + 1e-5) Indicates how “carb-dense” the recipe is. (e.g., high-carb desserts vs. low-carb meals).

The preparation time feature was log scaled to reduce the influence of significant outliers.

The nutritional features were all scaled with the recipe’s total calories. This was chosen because total calories is correlated to nutritional values for each recipe. Therefore, these nutrition ratios provide a better metric of the healthiness each recipe. It will also help highlight the percieved tastes of the recipes, as “saltier” foods would have a higher sodium ratio, “sweeter” foods have a higher sugar to calorie ratio.

We choose to implement the LASSO regression, due to the many catagorical tag features from the OnehotEncoding. LASSO should learn to disregard unimportant features from irrelevant tags.

We performed a gridsearch of the alpha hyperparameter, with the possible values of [0.01, 0.1, 1]. In addition, we perform cross-validation with 3-folds to validate the optimal alpha value.

The hyperparameter that performed best was alpha = 0.01, which demonstrated an MAE value of 0.756.

The final model with the additional feature engineering is a small improvement over the base line. (MAE 0.756 vs. MAE 0.760) Further improvement to the existing model can be done by gridsearching over more hyperparameters and increasing the max iterations from the default 1000 intereations.

Additionally, we have assumed a linear relationship to predict the average ratings. Training a non-linear model may produce better results.