Predicting Wine Quality

I found this dataset on UCI machine learning repository which gives the dataset related to red and white variants of the Portuguese “Vinho Verde” wine which has physicochemical variables (input) and sensory variables (output) available. Due to privacy, the wine brand, type of grape and selling price is not provided. In this post, we will try to predict the quality based on the physicochemical variables.

library(reshape2)
library(ggplot2)
library(ggh4x)
library(ggcorrplot)
library(GGally) # for pairs plot using ggplot framework

# Load the data
path <- "https://raw.githubusercontent.com/adityaranade/portfolio/refs/heads/main/wine/winequality-red.csv"
data0 <- read.csv(path, header = TRUE)

# Data processing
# Check the type of data
data0 |> str()

'data.frame':   1599 obs. of  12 variables:
 $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
 $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
 $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
 $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
 $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
 $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
 $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
 $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
 $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
 $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
 $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
 $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

# Check the rows which do not have any entries
sum(is.na(data0)) # no NA values

[1] 0

# data (no need to exclude anything)
data <- data0

head(data)

  fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
1           7.4             0.70        0.00            1.9     0.076
2           7.8             0.88        0.00            2.6     0.098
3           7.8             0.76        0.04            2.3     0.092
4          11.2             0.28        0.56            1.9     0.075
5           7.4             0.70        0.00            1.9     0.076
6           7.4             0.66        0.00            1.8     0.075
  free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
1                  11                   34  0.9978 3.51      0.56     9.4
2                  25                   67  0.9968 3.20      0.68     9.8
3                  15                   54  0.9970 3.26      0.65     9.8
4                  17                   60  0.9980 3.16      0.58     9.8
5                  11                   34  0.9978 3.51      0.56     9.4
6                  13                   40  0.9978 3.51      0.56     9.4
  quality
1       5
2       5
3       5
4       6
5       5
6       5

# Correlation plot
ggcorrplot(round(cor(data), 2))

# Pairs plot between the explanatory variables to 
# check correlation between each pair of the variables
ggpairs(data[,-ncol(data)])

The correlation plot indicates moderate multicolinearity. The histogram of the response variable quality can be seen below. It can be seen most of the entries take the value 5, 6 or 7. Very few entries have value 3, 4 or 8.

# Check the histogram of the response variable
ggplot(data,aes(quality))+
  geom_histogram()+
  theme_bw()

This can be treated as a count data and we can use the Poisson regression. However in Poisson regression, the mean and variance of the response variable quality should be same.

# Check if the mean and variance of response variable is same.

# Mean
data$quality %>% mean

[1] 5.636023

# Variance
data$quality %>% var

[1] 0.6521684

The mean is considerably greater than variance. Hence Poisson regression model will not work well. An alternative is to use a quasi Poisson model or Negative Binomial model. We will start with a basic multiple linear regression model. First we process the data and standardize all the variables.

# Standardize the data
# Scale everything except the response variable (last column)
data2 <- data
data2[, -ncol(data)] <- scale(data[, -ncol(data)])

# split the data into training and testing data
seed <- 23
set.seed(seed)

ind <- sample(1:nrow(data2),
              floor(0.8*nrow(data2)),
              replace = FALSE)

# Training dataset
data_train <- data2[ind,] |> as.data.frame()
# Testing dataset
data_test <- data2[-ind,] |> as.data.frame()
data_test |> dim()

[1] 320  12

Now we run a linear regression model with quality being predicted as a response to volatile acidity, chlorides, total sulfur dioxide, pH, sulphates and alcohol. These variables have been selected after trying multiple combinations of predictor variables. Once we build the model and make predictions, we will round the response variable and make a Predicted vs. actual table. This will help us understand how many correct values were predicted.

model <- lm(quality ~ volatile.acidity + chlorides + total.sulfur.dioxide 
            + pH + sulphates + alcohol,
            data = data_train)
summary(model)


Call:
lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + 
    pH + sulphates + alcohol, data = data_train)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.33867 -0.36947 -0.04921  0.47513  1.95783 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           5.63636    0.01818 310.079  < 2e-16 ***
volatile.acidity     -0.18389    0.02001  -9.191  < 2e-16 ***
chlorides            -0.09564    0.02038  -4.692 2.99e-06 ***
total.sulfur.dioxide -0.07486    0.01908  -3.923 9.22e-05 ***
pH                   -0.07111    0.01990  -3.573 0.000366 ***
sulphates             0.15764    0.02120   7.436 1.90e-13 ***
alcohol               0.30503    0.01989  15.333  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6499 on 1272 degrees of freedom
Multiple R-squared:  0.3589,    Adjusted R-squared:  0.3559 
F-statistic: 118.7 on 6 and 1272 DF,  p-value: < 2.2e-16

# Prediction on the testing dataset
y_pred <- predict(model, data_test, type = "response")

# # Create a observed vs. predicted plot
# ggplot(NULL,aes(y_pred,data_test$quality))+geom_point()+
#   labs(y = "Observed", x="Predicted")+theme_minimal()+geom_abline()

table_lm <- table(Predicted = as.factor(round(y_pred)),
                  Actual = as.factor(data_test$quality))
table_lm

         Actual
Predicted   3   4   5   6   7   8
        5   0   9 102  29   3   0
        6   1   4  37  91  31   2
        7   0   0   1   4   5   1

Since there is some multicolinearity in the data, we can try to use L2 regularization which is also called as ridge regression. This helps shrink the coefficients of the model towards zero.

library(glmnet)

Warning: package 'glmnet' was built under R version 4.4.3

Loading required package: Matrix

Loaded glmnet 4.1-8

#| label: mlr
#| echo: true
#| warning: false
#| include: true

model_l2_cv <- cv.glmnet(as.matrix(data_train[,-ncol(data_train)]),
                         as.matrix(data_train[,ncol(data_train)]),
                         alpha = 1)

#find optimal lambda value that minimizes test MSE
best_lambda <- model_l2_cv$lambda.min
best_lambda

[1] 0.004486534

model_l2 <- glmnet(as.matrix(data_train[,-ncol(data_train)]),
                   as.matrix(data_train[,ncol(data_train)]),
                   alpha = 1, 
                   lambda = best_lambda)
coef(model_l2)

12 x 1 sparse Matrix of class "dgCMatrix"
                              s0
(Intercept)           5.63622382
fixed.acidity         0.01712355
volatile.acidity     -0.19348099
citric.acid          -0.03035695
residual.sugar        0.01923240
chlorides            -0.08408944
free.sulfur.dioxide   0.02921787
total.sulfur.dioxide -0.09162867
density               .         
pH                   -0.06792325
sulphates             0.15171519
alcohol               0.30355610

model_l2 |> summary()

          Length Class     Mode   
a0         1     -none-    numeric
beta      11     dgCMatrix S4     
df         1     -none-    numeric
dim        2     -none-    numeric
lambda     1     -none-    numeric
dev.ratio  1     -none-    numeric
nulldev    1     -none-    numeric
npasses    1     -none-    numeric
jerr       1     -none-    numeric
offset     1     -none-    logical
call       5     -none-    call   
nobs       1     -none-    numeric

# Prediction on the testing dataset
y_pred_l2 <- predict(model_l2,  s = best_lambda,
                     newx= as.matrix(data_test[,-ncol(data_test)]))

table_l2 <- table(Predicted = as.factor(round(y_pred_l2)),
                  Actual = as.factor(data_test$quality))
table_l2

         Actual
Predicted   3   4   5   6   7   8
        5   0   9 101  29   2   0
        6   1   4  38  91  32   3
        7   0   0   1   4   5   0

Now, we will try the support vector machines regression model.

# SVM regression
library(e1071)
model_svm <- svm(quality ~ volatile.acidity + chlorides + total.sulfur.dioxide 
                 + pH + sulphates + alcohol,
                 data = data_train,
                 type = "eps-regression", kernel = "radial")

y_pred_svm <- predict(model_svm, data_test)

table(Predicted = round(y_pred_svm),
      Actual = data_test$quality)

         Actual
Predicted   3   4   5   6   7   8
        5   0  10 117  36   3   0
        6   1   3  23  85  28   2
        7   0   0   0   3   8   1

Now, we will try Bootstrap Aggregating (Bagging) model.

# Bagging
library(ipred)
model_bagging <- bagging(quality ~ volatile.acidity + chlorides + total.sulfur.dioxide 
                         + pH + sulphates + alcohol, data = data_train, nbagg = 100)


y_pred_bagging <- predict(model_bagging, data_test)

table(Predicted = round(y_pred_bagging),
      Actual = data_test$quality)

         Actual
Predicted   3   4   5   6   7   8
        5   0  11 114  36   4   0
        6   1   2  26  84  24   2
        7   0   0   0   4  11   1

Now, we will try the Boosting model.

# Boosting
library(gbm)
model_boosting <- gbm(quality ~ volatile.acidity + chlorides + total.sulfur.dioxide 
                      + pH + sulphates + alcohol, 
                      data = data_train, 
                      distribution = "gaussian", 
                      n.trees = 100, interaction.depth = 3, 
                      shrinkage = 0.01, cv.folds = 5)

best_trees <- gbm.perf(model_boosting, method = "cv")

best_trees

[1] 100

# Predictions
y_pred_boosting <- predict(model_boosting, data_test, n.trees = best_trees)
table(data_test$quality, round(y_pred_boosting))

Now, we will try Xgboot (Xtreme Gradient Boosting) model.

# XGBOOST
library(xgboost)
# Convert to matrix (xgboost requires the data as matrix or DMatrix format)
data_train_matrix <- as.matrix(data_train[,-ncol(data_train)])  
target <- data_train[,ncol(data_train)]

# Create a DMatrix object (this is how xgboost stores and handles data)
dtrain <- xgb.DMatrix(data = data_train_matrix, label = target)

params <- list(
  objective = "reg:squarederror",  # Objective function (regression)
  eta = 0.1,                      # Learning rate
  max_depth = 6,                   # Maximum depth of trees
  colsample_bytree = 0.8,          # Subsample fraction of features
  subsample = 0.8,                 # Subsample fraction of data
  alpha = 0.1                      # L2 regularization
)

xgb_model <- xgb.train(
  params = params, 
  data = dtrain, 
  nrounds = 100,         # Number of boosting rounds (iterations)
  watchlist = list(train = dtrain),  # To monitor the training process
  print_every_n = 10     # Print every 10 iterations
)

[1] train-rmse:4.696259 
[11]    train-rmse:1.751696 
[21]    train-rmse:0.787100 
[31]    train-rmse:0.499741 
[41]    train-rmse:0.407594 
[51]    train-rmse:0.368031 
[61]    train-rmse:0.331699 
[71]    train-rmse:0.300139 
[81]    train-rmse:0.273557 
[91]    train-rmse:0.246384 
[100]   train-rmse:0.225383

# Make predictions on the testing data
data_test_matrix <- as.matrix(data_test[,-ncol(data_test)])  
predictions <- predict(xgb_model, data_test_matrix)

table(round(predictions),data_test$quality)

   
      3   4   5   6   7   8
  4   0   0   1   0   1   0
  5   1  10 118  24   2   0
  6   0   3  21  97  21   2
  7   0   0   0   3  15   1

Lastly, we will try k nearest neighbour (knn) model.

# KNN
library(caret)
# Apply KNN with k = 5 (5 nearest neighbors)
train_control <- trainControl(method = "cv", number = 5)  # 5-fold cross-validation
model_knn <- train(quality ~ volatile.acidity + chlorides + total.sulfur.dioxide 
                   + pH + sulphates + alcohol, 
                   data = data_train, 
                   method = "knn", 
                   trControl = train_control, 
                   tuneLength = 10)

# Check best K and accuracy
print(model_knn$bestTune)

   k
5 13

print(model_knn$results)

    k      RMSE  Rsquared       MAE     RMSESD RsquaredSD      MAESD
1   5 0.6798989 0.3220740 0.5143110 0.02943495 0.04376877 0.02301630
2   7 0.6731686 0.3253252 0.5172371 0.03212803 0.04431381 0.02571618
3   9 0.6616948 0.3417998 0.5145655 0.03065211 0.04027093 0.02510235
4  11 0.6552774 0.3519570 0.5150387 0.03275645 0.04130796 0.02957085
5  13 0.6491360 0.3632288 0.5129367 0.03298546 0.03808877 0.02952243
6  15 0.6497144 0.3608939 0.5157876 0.03146983 0.03170699 0.02620662
7  17 0.6502542 0.3600491 0.5175585 0.02916799 0.02787033 0.02289953
8  19 0.6508248 0.3600546 0.5184179 0.02817991 0.02841868 0.02239239
9  21 0.6503630 0.3611082 0.5177740 0.02473004 0.02581504 0.02087044
10 23 0.6504593 0.3611569 0.5187199 0.02345332 0.02202820 0.01893757

y_pred_knn <- predict(model_knn, data_test)

table(round(y_pred_knn),data_test$quality)

   
      3   4   5   6   7   8
  5   0   9 103  30   6   0
  6   1   4  37  85  24   2
  7   0   0   0   9   9   1

Based on the multiple models, the Extreme Gradiant Boosting (xgboost) model is able to get the most accuracy in out case. It got the quality correctly for 230 out of 320 cases in the testing data set which translates to an accuracy of around 71.88%.