Cereals nutritional information

Cereals are commonly consumed for breakfast. But how good are they in terms of the nutritional value? Can we predict the calories based on the nutritional contents ?

I found this dataset on Kaggle which gives the nutritional information about their cereals. First, we look at the exploratory data analysis and later try some simple regression models. First let us access and process the data through python

# Load Libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from plotnine import *
import numpy as np # linear algebra
# import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import random
from scipy.stats import pearsonr

# Get data from github repo

path = "https://raw.githubusercontent.com//adityaranade//portfolio//refs//heads//main//cereals//cereal.csv"

df0=pd.read_csv(path, encoding='unicode_escape')

df0.head()

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	C	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	C	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	C	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	C	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	C	110	2	2	200	1.0	14.0	8	-1	25	3	1.0	0.75	34.384843

# modify the column names
df0.columns = ['name', 'manufacturer','type','calories','protein','fat','sodium','fiber','carbohydrates','sugar','potassium','vitamins','shelf','weight','cups', 'rating']
df0.head()

	name	manufacturer	type	calories	protein	fat	sodium	fiber	carbohydrates	sugar	potassium	vitamins	shelf	weight	cups	rating
0	100% Bran	N	C	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	C	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	C	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	C	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	C	110	2	2	200	1.0	14.0	8	-1	25	3	1.0	0.75	34.384843

# select data for the histogram
df = df0[["calories", "protein", "fat", "sodium", "fiber", "carbohydrates", "sugar","potassium","name"]]
df.head()

# Use melt function for the histograms of variables 
df2 = pd.melt(df, id_vars=['name'])
# df2.head()

Now that we have the data ready, let us look at the histogram of each variables namely nutritional contents, specifically calories, protein, fat, sodium, fiber, carbo, sugars and potassium

p = (
    ggplot(df2, aes("value"))
    + geom_histogram(bins=10)
    + facet_grid(". ~ variable", scales='free_x')
    + theme(figure_size=(12, 3))
    )

# If we want the density on y axis
# p = (
#     ggplot(df2, aes("value", after_stat("density")))
#     + geom_histogram(bins=10)
#     + facet_grid(". ~ variable", scales='free_x')
#     + theme(figure_size=(12, 3))
#     )

p.show()

The histogram of each of the variables do not show any problems as all the plots look decent. We will look at the correlation plot.

# Check the correlation between the variables
plt.figure(figsize=(20,10))
sns.heatmap(df.iloc[:,:-1].corr(),annot=True,cmap="viridis")
plt.show()

Calories variable has significant positive correlation with all the variables except fiber and potassium. This seems logical and will be useful when we build a regression model for the same. Next we take a look at the pairs plot which will give us idea about relationship between each pair of variables. Most important from the point of prediction is the first row where calories is the y axis and each of the variable is x axis.

# Pairs plot
g = sns.PairGrid(df.iloc[:,1:])
g.map_diag(sns.histplot)
g.map_upper(sns.scatterplot)
g.map_lower(sns.kdeplot)
plt.show()

The scatterplots of each variable with calories which can be seen in the upper triangular plots in the very first row. It seems there is a linear association between calories and fat, carbs and protein. However, it does not seem to have a linear association with fiber.

# Split data into train and test set
indices = range(len(df)) # Create a list of indices

# Get 75% random indices
random.seed(23) # for reproducible example
random_indices = random.sample(indices, round(0.75*len(df)))

# Training dataset
data_train = df.iloc[random_indices,:-1]

# Testing dataset
data_test = df.iloc[df.index.difference(random_indices),:-1]

# Build a multiple linear regression model to predict calories using other variables using training data
result = smf.ols("calories ~ protein + fat + sodium + fiber + carbohydrates + sugar + potassium", data = data_train).fit()
# check the summary
result.summary()

OLS Regression Results
Dep. Variable:	calories	R-squared:	0.783
Model:	OLS	Adj. R-squared:	0.753
Method:	Least Squares	F-statistic:	25.78
Date:	Sun, 03 Aug 2025	Prob (F-statistic):	1.59e-14
Time:	20:33:23	Log-Likelihood:	-205.34
No. Observations:	58	AIC:	426.7
Df Residuals:	50	BIC:	443.2
Df Model:	7
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	27.1211	8.231	3.295	0.002	10.588	43.654
protein	5.8767	1.433	4.102	0.000	2.999	8.754
fat	8.7752	1.482	5.920	0.000	5.798	11.753
sodium	0.0062	0.016	0.381	0.705	-0.026	0.039
fiber	0.1062	1.480	0.072	0.943	-2.866	3.078
carbohydrates	2.5884	0.380	6.811	0.000	1.825	3.352
sugar	3.1356	0.365	8.587	0.000	2.402	3.869
potassium	-0.0489	0.052	-0.937	0.353	-0.154	0.056

Omnibus:	29.387	Durbin-Watson:	2.071
Prob(Omnibus):	0.000	Jarque-Bera (JB):	80.990
Skew:	1.410	Prob(JB):	2.59e-18
Kurtosis:	8.055	Cond. No.	1.39e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.39e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

p-value for sodium, fiber and potassium is considerably high and hence these variables do not add help predict the calories. This might be due to multicollinearity (the predictor variables are have high correlation within themselves). If we look at the correlation plot, fiber and potassium has 0.9 correlation which is high. One way to tackle multicollinearity is to consider principal component analysis (PCA). We will look at it in a while but let us first try to make predictions and look at the evaluation metrics.

Now let us make prediction on the testing data and plot the observed vs. predicted plot

# Make predictions using testing data
predictions = result.predict(data_test)

# Observed vs. Predicted plot
plt.figure(figsize=(20,7))
plt.scatter(predictions, data_test["calories"])
plt.ylabel("Observed calories")
plt.xlabel("Predicted calories")
# Create the abline
x_line = np.linspace(min(data_test["calories"]), max(data_test["calories"]), 100)
y_line = 1 * x_line + 1
plt.plot(x_line, y_line, color='red')
plt.show()

The observed vs. predicted looks good. However there is low number of data points and hence we should take this with a grain of salt. Let us check some evaluation metrics like the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).

from sklearn.metrics import mean_absolute_error,mean_squared_error
print("Mean Absolute Error:",round(mean_absolute_error(data_test["calories"],predictions),2))
print("Root Mean Squared Error:",round((mean_squared_error(data_test["calories"],predictions))** 0.5,2))

Mean Absolute Error: 5.34
Root Mean Squared Error: 6.89

Root Mean Squared Error (RMSE) of 5.34 and Mean Absolute Error (MAE) of 6.89 is decent and indicates model is performing fairly well.

Now, we will run regression model based on principal component analysis since it helps with multicollinearity.

# Principal component analysis
from sklearn.decomposition import PCA

# separate the x and y variable for the training data first
y_train = data_train.iloc[:,:1]
X0_train = data_train.iloc[:,1:]

# Standardize the predictor data first
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# training data
X_train_scaled = sc.fit_transform(X0_train)

# Now calculate the principal components
from sklearn.decomposition import PCA
pca = PCA()
principalComponents = pca.fit_transform(X_train_scaled)
# Training data
X_train_pca = pd.DataFrame(data = principalComponents,
             columns=['PC{}'.format(i+1)
                      for i in range(principalComponents.shape[1])])



explained_variance = pca.explained_variance_ratio_
explained_variance

array([0.37089562, 0.24123216, 0.14712018, 0.12790205, 0.05655818,
       0.0483467 , 0.0079451 ])

The first six principal components explain around 99% of the data, so we will use the first 6 principal components to build a regression model.

X_train_pca = pd.DataFrame(data = principalComponents,
             columns=['PC{}'.format(i+1)
                      for i in range(principalComponents.shape[1])])

# combine the X and Y for the training data
data_train_pca = X_train_pca
data_train_pca.set_index(X0_train.index,inplace = True)
data_train_pca['calories'] = y_train
data_train_pca.head()

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	calories
37	-1.860753	0.565043	0.525620	-0.850117	-0.237526	0.005600	0.155876	110
10	-1.341762	1.558699	0.886958	0.766110	-0.195248	-0.671615	-0.098032	120
2	4.512654	-0.609249	2.068686	-0.856303	-0.653496	-0.585533	0.433959	70
39	-0.253116	-0.667314	0.281191	0.402756	0.867213	0.766825	-0.165899	140
54	-1.338187	-0.005653	-2.407667	-1.446089	-0.053030	-1.193701	0.316424	50

# Correlation plot for principal components
plt.figure(figsize=(20,10))
sns.heatmap(data_train_pca.corr().round(4),annot=True, cmap="viridis")
plt.show()

We can observe that only calories variable has correlation with the principal components and the correlation between the principal components is practically 0. So we will use the principal components to build a regression model.

# Now run the OLS regression model on the first five principal components
# Fit the OLS regression
result_pca = smf.ols("calories ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6", data = data_train_pca).fit()
# check the summary
result_pca.summary()

OLS Regression Results
Dep. Variable:	calories	R-squared:	0.774
Model:	OLS	Adj. R-squared:	0.747
Method:	Least Squares	F-statistic:	29.03
Date:	Sun, 03 Aug 2025	Prob (F-statistic):	7.97e-15
Time:	20:33:24	Log-Likelihood:	-206.59
No. Observations:	58	AIC:	427.2
Df Residuals:	51	BIC:	441.6
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	107.7586	1.194	90.277	0.000	105.362	110.155
PC1	-1.8336	0.741	-2.475	0.017	-3.321	-0.346
PC2	3.8950	0.919	4.240	0.000	2.051	5.739
PC3	4.3355	1.176	3.686	0.001	1.974	6.697
PC4	10.9172	1.261	8.654	0.000	8.385	13.450
PC5	9.0880	1.897	4.791	0.000	5.279	12.896
PC6	12.7554	2.052	6.217	0.000	8.636	16.875

Omnibus:	21.542	Durbin-Watson:	2.118
Prob(Omnibus):	0.000	Jarque-Bera (JB):	44.731
Skew:	1.104	Prob(JB):	1.94e-10
Kurtosis:	6.692	Cond. No.	2.77

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

\(R^{2}\) is 77.4% which is decent and all the predictor variables have a low p-value value. We make predictions using the test data and then plot the out of sample observed vs. predicted. First we calculate the principal components of the testing data and then make the predictions.

# X for testing data
X0_test = data_test.iloc[:,1:]

# scaled test data
X_test_scaled = sc.transform(X0_test)

# calculate the principal components for the testing data
X_test = pca.transform(X_test_scaled)
X_test_pca = pd.DataFrame(data = X_test,
             columns=['PC{}'.format(i+1)
                      for i in range(X_test.shape[1])])
# calculate the predictions
predictions_pca = result_pca.predict(X_test_pca)

Now we plot the out of sample predictions obtained from regression model using raw data as well as the predictions obtained from model using the first six principal components on the same plot with different colors.

# Observed vs. Predicted plot
plt.figure(figsize=(20,7))

plt.scatter(predictions, data_test["calories"], label='raw', color='black', marker='o')
plt.scatter(predictions_pca, data_test["calories"],  label='PCA', color='blue', marker='o')
# sns.regplot(y = data_test["calories"],x = predictions,ci=None,line_kws={"color":"red"})
plt.ylabel("Observed calories")
plt.xlabel("Predicted calories")
plt.legend()

# Create the abline
x_line = np.linspace(min(data_test["calories"]), max(data_test["calories"]), 100)
y_line = 1 * x_line + 1
plt.plot(x_line, y_line, color='red')
plt.show()

The out of sample observed vs. predicted plot looks decent with all the points just around the red line. WE look at the evaluation metrics for the model built using the principal components.

from sklearn.metrics import mean_absolute_error,mean_squared_error
print("Mean Absolute Error:",round(mean_absolute_error(data_test["calories"],predictions_pca),2))
print("Root Mean Squared Error:",round((mean_squared_error(data_test["calories"],predictions_pca))** 0.5,2))

Mean Absolute Error: 4.66
Root Mean Squared Error: 6.11

For the regression model using first six principal components, Root Mean Squared Error (RMSE) is 4.66 and Mean Absolute Error (MAE) is 6.11 which is an improvement from the regression model using the raw data.