Starbucks food nutritional information

Starbucks is one of the most valued coffee chain in the world. A lot of people like to consume the food available at starbucks. But how good are they in terms of the nutritional value?

I found this dataset on Kaggle which gives the nutritional information about their food products. In my precious post, I built a multiple linear regression model to predict the calories in beverage based on the nutritional contents of the beverage. Now we will try to do the same for the food products.

First, we look at the exploratory data analysis and later try some simple regression models. First let us access and process the data through python

# Load Libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from plotnine import * # for plots
import numpy as np # linear algebra
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.formula.api as smf
import random
from scipy.stats import pearsonr

# Get starbucks data from github repo
df0=pd.read_csv("https://raw.githubusercontent.com//adityaranade//starbucks//refs//heads//main//data//starbucks-menu-nutrition-food.csv", encoding='unicode_escape')
df0.head()

	Unnamed: 0	Calories	Fat (g)	Carb. (g)	Fiber (g)	Protein (g)
0	Chonga Bagel	300	5.0	50	3	12
1	8-Grain Roll	380	6.0	70	7	10
2	Almond Croissant	410	22.0	45	3	10
3	Apple Fritter	460	23.0	56	2	7
4	Banana Nut Bread	420	22.0	52	2	6

#modify the column names
df0.columns = ['name', 'calories','fat','carbs','fiber','protein']
df0.head()

#convert data type to float for all the columns except name
for i in df0.columns[1:]:
    df0[i]=df0[i].astype("float")
# df0.info()

df = df0
# Use melt function for the histograms
df2 = pd.melt(df, id_vars=['name'])
# df2.head()

Now that we have the data ready, let us look at the histogram of each variables namely calories, fat, carbs, fiber, protein and sodium

p = (
    ggplot(df2, aes("value"))
    + geom_histogram(bins=15)
    + facet_grid(". ~ variable", scales='free_x')
    )

p.show()

The histogram of each of the variables do not show any problems as all the plots look decent. We will look at the correlation plot.

# Check the correlation between the variables
# plt.figure(figsize=(20,7))
sns.heatmap(df.iloc[:,1:].corr(),annot=True)
plt.show()

Correlation plot indicates positive association between all the variables which is desired. Now we will look at the pairs plot which will show the pairwise histogram.

# Pairs plot
g = sns.PairGrid(df.iloc[:,1:])
g.map_diag(sns.histplot)
g.map_upper(sns.scatterplot)
g.map_lower(sns.scatterplot)
plt.show()

The scatterplots of each variable with calories which can be seen in the upper triangular plots in the very first row. It seems there is a linear association between calories and fat, carbs and protein. However, it does not seem to have a linear association with fiber.

# Split data into train and test set
indices = range(len(df)) # Create a list of indices

# Get 75% random indices for training data
random.seed(23) # for repreducible example
random_indices = random.sample(indices, round(0.75*len(df)))

# Training dataset
data_train = df.iloc[random_indices,]

# Testing dataset
data_test = df.iloc[df.index.difference(random_indices),]


# Build a multiple linear regression model to predict calories using other variables using training data
result = smf.ols("calories ~ fat + carbs + fiber + protein", data = data_train).fit()
# check the summary
result.summary()

OLS Regression Results
Dep. Variable:	calories	R-squared:	0.995
Model:	OLS	Adj. R-squared:	0.995
Method:	Least Squares	F-statistic:	4351.
Date:	Sun, 11 May 2025	Prob (F-statistic):	1.06e-92
Time:	08:26:44	Log-Likelihood:	-304.54
No. Observations:	85	AIC:	619.1
Df Residuals:	80	BIC:	631.3
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-2.8394	3.006	-0.945	0.348	-8.821	3.142
fat	8.8760	0.129	68.933	0.000	8.620	9.132
carbs	4.0291	0.066	61.478	0.000	3.899	4.160
fiber	-1.1151	0.402	-2.772	0.007	-1.916	-0.315
protein	4.3474	0.156	27.915	0.000	4.037	4.657

Omnibus:	8.990	Durbin-Watson:	2.042
Prob(Omnibus):	0.011	Jarque-Bera (JB):	12.887
Skew:	0.426	Prob(JB):	0.00159
Kurtosis:	4.707	Cond. No.	151.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Now let us make prediction on the testing data and plot the observed vs. predicted plot

# Make predictions using testing data
predictions = result.predict(data_test)

# Observed vs. Predicted plot
plt.figure(figsize=(20,7))
sns.regplot(y = data_test["calories"],x = predictions,line_kws={"color":"red"})
plt.ylabel("Observed calories")
plt.xlabel("Predicted calories")
plt.show()
# decent plot

The observed vs. predicted looks good. However there is low number of data points and hence we should take this with a grain of salt. Let us check some evaluation metrics like the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).

from sklearn.metrics import mean_absolute_error,mean_squared_error,r2_score
print("Mean Absolute Error:",round(mean_absolute_error(data_test["calories"],predictions),2))
print("Root Mean Squared Error:",round((mean_squared_error(data_test["calories"],predictions))** 0.5,2))

Mean Absolute Error: 7.54
Root Mean Squared Error: 10.7

Root Mean Squared Error (RMSE) of 7.54 and Mean Absolute Error (MAE) of 10.7 is decent and indicates model is performing fairly well.