Rating of cereals based on nutritional information

Cereals are commonly consumed for breakfast and there are plenty of options available for cereals. I found this dataset on Kaggle which gives the nutritional information about cereals as well as the ratings. It is not clear where the rating come from, but I think they are the average ratings from the customers. Can we predict the ratings based on the nutritional information of cereals ? First, we look at the exploratory data analysis and later try some simple regression models. First let us access and process the data through python.

# Load Libraries

# Load Libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from plotnine import *
import numpy as np # linear algebra
# import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import random
from scipy.stats import pearsonr

# Get data from github repo
path = "https://raw.githubusercontent.com//adityaranade//portfolio//refs//heads//main//cereals//cereal.csv"

df0=pd.read_csv(path, encoding='unicode_escape')

df0.head()

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	C	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	C	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	C	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	C	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	C	110	2	2	200	1.0	14.0	8	-1	25	3	1.0	0.75	34.384843

# modify the column names
df0.columns = ['name', 'manufacturer','type','calories','protein','fat','sodium','fiber','carbohydrates','sugar','potassium','vitamins','shelf','weight','cups', 'rating']
df0.head()

	name	manufacturer	type	calories	protein	fat	sodium	fiber	carbohydrates	sugar	potassium	vitamins	shelf	weight	cups	rating
0	100% Bran	N	C	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	C	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	C	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	C	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	C	110	2	2	200	1.0	14.0	8	-1	25	3	1.0	0.75	34.384843

# select data for the histogram
df = df0[["calories", "protein", "fat", "sodium", "fiber", "carbohydrates", "sugar","potassium","rating","name"]]
df.head()

	calories	protein	fat	sodium	fiber	carbohydrates	sugar	potassium	rating	name
0	70	4	1	130	10.0	5.0	6	280	68.402973	100% Bran
1	120	3	5	15	2.0	8.0	8	135	33.983679	100% Natural Bran
2	70	4	1	260	9.0	7.0	5	320	59.425505	All-Bran
3	50	4	0	140	14.0	8.0	0	330	93.704912	All-Bran with Extra Fiber
4	110	2	2	200	1.0	14.0	8	-1	34.384843	Almond Delight

Now that we have the data ready, let us look at the histogram of each variables namely nutritional contents, specifically calories, protein, fat, sodium, fiber, carbo, sugars and potassium

# Use melt function for the histograms of variables 
df2 = pd.melt(df, id_vars=['name'])
# df2.head()

p = (
    ggplot(df2, aes("value"))
    + geom_histogram(bins=10)
    + facet_grid(". ~ variable", scales='free_x')
    + theme(figure_size=(12, 3))
    )

# If we want the density on y axis
# p = (
#     ggplot(df2, aes("value", after_stat("density")))
#     + geom_histogram(bins=10)
#     + facet_grid(". ~ variable", scales='free_x')
#     + theme(figure_size=(12, 3))
#     )

p.show()

The histogram of each of the variables do not show any problems as all the plots look decent. We will look at the correlation plot, which shows the correlation between each pair of variables in a visual form.

# Check the correlation between the variables
plt.figure(figsize=(20,10))
sns.heatmap(df.iloc[:,:-1].corr(),annot=True,cmap="viridis")
plt.show()

Rating variable has positive correlation with all the variables except sugar, sodium, fat and calories . This seems logical and will be useful when we build a regression model for the same. Next we take a look at the pairs plot which will give us idea about relationship between each pair of variables. Most important from the point of prediction is the last row where rating is the y axis and each of the variable is x axis.

# Pairs plot
g = sns.PairGrid(df.iloc[:,:-1])
g.map_diag(sns.histplot)
g.map_lower(sns.scatterplot)
g.map_upper(sns.kdeplot)
plt.show()

The scatterplots of each variable with calories which can be seen in the upper triangular plots in the very first row. It seems there is a linear association between calories and fat, carbs and protein. However, it does not seem to have a linear association with fiber.

# Split data into train and test set
indices = range(len(df)) # Create a list of indices

# Get 75% random indices
random.seed(55) # for reproducible example
random_indices = random.sample(indices, round(0.75*len(df)))

# Training dataset
data_train = df.iloc[random_indices,:-1]

# Testing dataset
data_test = df.iloc[df.index.difference(random_indices),:-1]

# Build a multiple linear regression model to predict calories using other variables using training data
result = smf.ols("rating ~ calories + protein + fat + sodium + fiber + carbohydrates + sugar + potassium", data = data_train).fit()
# check the summary
result.summary()

OLS Regression Results
Dep. Variable:	rating	R-squared:	0.997
Model:	OLS	Adj. R-squared:	0.997
Method:	Least Squares	F-statistic:	2429.
Date:	Mon, 02 Jun 2025	Prob (F-statistic):	6.35e-61
Time:	18:41:04	Log-Likelihood:	-65.696
No. Observations:	58	AIC:	149.4
Df Residuals:	49	BIC:	167.9
Df Model:	8
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	54.4433	0.868	62.753	0.000	52.700	56.187
calories	-0.2160	0.014	-15.385	0.000	-0.244	-0.188
protein	3.3032	0.152	21.680	0.000	2.997	3.609
fat	-1.8546	0.196	-9.451	0.000	-2.249	-1.460
sodium	-0.0573	0.001	-38.480	0.000	-0.060	-0.054
fiber	3.3924	0.140	24.228	0.000	3.111	3.674
carbohydrates	1.0424	0.048	21.634	0.000	0.946	1.139
sugar	-0.7696	0.054	-14.272	0.000	-0.878	-0.661
potassium	-0.0324	0.005	-6.648	0.000	-0.042	-0.023

Omnibus:	78.045	Durbin-Watson:	2.069
Prob(Omnibus):	0.000	Jarque-Bera (JB):	801.691
Skew:	-3.864	Prob(JB):	8.22e-175
Kurtosis:	19.492	Cond. No.	1.82e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.82e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

p-value for all the variables is low so all the variables are significantly affecting the response variable, rating. However the model output indicates there might be multicollinearity issue. Multicollinearity means the predictor variables are have high correlation among themselves. If we look at the correlation plot, fiber and potassium has 0.9 correlation which is high. One way to tackle multicollinearity is to consider principal component analysis (PCA). We will look at it in a while but let us first try to make predictions and look at the evaluation metrics.

Now let us make prediction on the testing data and plot the observed vs. predicted plot

# Make predictions using testing data
predictions = result.predict(data_test)

# Observed vs. Predicted plot
plt.figure(figsize=(20,7))
plt.scatter(predictions, data_test["rating"])
plt.ylabel("Observed rating")
plt.xlabel("Predicted rating")
# Create the abline
x_line = np.linspace(min(data_test["rating"])-2, max(data_test["rating"])+2, 100)
y_line = 1 * x_line + 1
plt.plot(x_line, y_line, color='red')
plt.show()

The observed vs. predicted looks good. However there is low number of data points and hence we should take this with a grain of salt. Let us check some evaluation metrics like the Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE).

from sklearn.metrics import mean_absolute_error,mean_squared_error
print("Mean Absolute Error:",round(mean_absolute_error(data_test["rating"],predictions),2))
print("Root Mean Squared Error:",round((mean_squared_error(data_test["rating"],predictions))** 0.5,2))

Mean Absolute Error: 1.05
Root Mean Squared Error: 1.72

Root Mean Squared Error (RMSE) of 1.05 and Mean Absolute Error (MAE) of 1.72 is decent and indicates model is performing fairly well.

Now, we will run regression model based on principal component analysis since it helps with multicollinearity.

# Principal component analysis
from sklearn.decomposition import PCA

# separate the x and y variable for the training data first
y_train = data_train.iloc[:,-1:]
X0_train = data_train.iloc[:,:-1]

# Standardize the predictor data first
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
# training data
X_train_scaled = sc.fit_transform(X0_train)

# Now calculate the principal components
from sklearn.decomposition import PCA
pca = PCA()
principalComponents = pca.fit_transform(X_train_scaled)
# Training data
X_train_pca = pd.DataFrame(data = principalComponents,
             columns=['PC{}'.format(i+1)
                      for i in range(principalComponents.shape[1])])



explained_variance = pca.explained_variance_ratio_
explained_variance

array([0.36079172, 0.24639   , 0.16186567, 0.1045944 , 0.06695072,
       0.04472673, 0.00874256, 0.0059382 ])

The first six principal components explain around 98% of the data, so we will use the first six principal components to build a regression model.

X_train_pca = pd.DataFrame(data = principalComponents,
             columns=['PC{}'.format(i+1)
                      for i in range(principalComponents.shape[1])])

# combine the X and Y for the training data
data_train_pca = X_train_pca
data_train_pca.set_index(X0_train.index,inplace = True)
data_train_pca['rating'] = y_train
data_train_pca.head()

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	rating
11	0.865400	0.737424	3.082256	-0.797678	-1.695220	-0.565655	-0.436512	-0.175165	50.764999
25	-1.656332	-0.688910	-0.969216	1.205505	0.111314	-0.454199	0.373535	-0.159738	31.435973
19	0.979620	2.049689	-0.266540	-0.475020	-0.352195	1.032443	-0.089196	-0.058422	40.448772
38	-0.939218	-0.515171	0.188656	-0.014788	0.070899	0.306183	0.027561	0.007296	36.523683
10	-2.157184	1.069870	-0.921099	0.649449	-0.525503	0.549343	0.091509	-0.057116	18.042851

# Correlation plot for principal components
plt.figure(figsize=(20,10))
sns.heatmap(data_train_pca.corr().round(4),annot=True, cmap="viridis")
plt.show()

We can observe that only rating variable has correlation with the principal components and the correlation between the principal components is 0. The correlation of rating with principal component 6 and 8 is considerably low and hence we will not use them in the model. So we will use the principal components 1,2,3,4,5 and 8 to build a regression model.

# Now run the OLS regression model on the first five principal components
# Fit the OLS regression
result_pca = smf.ols("rating ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC8", data = data_train_pca).fit()
# check the summary
result_pca.summary()

OLS Regression Results
Dep. Variable:	rating	R-squared:	0.997
Model:	OLS	Adj. R-squared:	0.997
Method:	Least Squares	F-statistic:	3009.
Date:	Mon, 02 Jun 2025	Prob (F-statistic):	3.38e-63
Time:	18:41:06	Log-Likelihood:	-68.981
No. Observations:	58	AIC:	152.0
Df Residuals:	51	BIC:	166.4
Df Model:	6
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	43.6488	0.111	392.171	0.000	43.425	43.872
PC1	7.2019	0.066	109.932	0.000	7.070	7.333
PC2	-5.1916	0.079	-65.488	0.000	-5.351	-5.032
PC3	2.1011	0.098	21.481	0.000	1.905	2.297
PC4	-2.6963	0.122	-22.160	0.000	-2.941	-2.452
PC5	3.3596	0.152	22.091	0.000	3.054	3.665
PC8	-7.8789	0.511	-15.429	0.000	-8.904	-6.854

Omnibus:	59.001	Durbin-Watson:	2.019
Prob(Omnibus):	0.000	Jarque-Bera (JB):	340.488
Skew:	-2.843	Prob(JB):	1.16e-74
Kurtosis:	13.419	Cond. No.	7.79

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

\(R^{2}\) is 99.7% which is decent and all the predictor variables have a low p-value value. We make predictions using the test data and then plot the out of sample observed vs. predicted. First we calculate the principal components of the testing data and then make the predictions.

# X for testing data
X0_test = data_test.iloc[:,:-1]

# scaled test data
X_test_scaled = sc.transform(X0_test)

# calculate the principal components for the testing data
X_test = pca.transform(X_test_scaled)
X_test_pca = pd.DataFrame(data = X_test,
             columns=['PC{}'.format(i+1)
                      for i in range(X_test.shape[1])])
# calculate the predictions
predictions_pca = result_pca.predict(X_test_pca)

Now we plot the out of sample predictions obtained from regression model using raw data as well as the predictions obtained from model using the six principal components on the same plot with different colors.

# Observed vs. Predicted plot
plt.figure(figsize=(20,7))

plt.scatter(predictions, data_test["rating"], label='raw model', color='black', marker='o')
plt.scatter(predictions_pca, data_test["rating"],  label='PCA model', color='blue', marker='o')
# sns.regplot(y = data_test["calories"],x = predictions,ci=None,line_kws={"color":"red"})
plt.ylabel("Observed rating")
plt.xlabel("Predicted rating")
plt.legend()

# Create the abline
x_line = np.linspace(min(data_test["rating"]), max(data_test["rating"]), 100)
y_line = 1 * x_line + 1
plt.plot(x_line, y_line, color='red')
plt.show()

The out of sample observed vs. predicted plot looks decent with all the points just around the red line. WE look at the evaluation metrics for the model built using the principal components.

from sklearn.metrics import mean_absolute_error,mean_squared_error
print("Mean Absolute Error:",round(mean_absolute_error(data_test["rating"],predictions_pca),2))
print("Root Mean Squared Error:",round((mean_squared_error(data_test["rating"],predictions_pca))** 0.5,2))

Mean Absolute Error: 1.23
Root Mean Squared Error: 1.89

For the regression model using first six principal components, Root Mean Squared Error (RMSE) is 1.23 and Mean Absolute Error (MAE) is 1.89 which is slightly higher than the regression model using the raw data.