Predicting fuel consumption (MPG) of cars

I found this dataset on UCI machine learning repository which gives the dataset regarding the car features along with fuel consumption. The goal is to predict the fuel consumption indicated by the variable mpg based on other features of the car like horespower, displacement, weight, etc. of car. We will compare multiple Machine Learning models for the same.

library(reshape2)
library(ggplot2)
library(ggh4x)
library(ggcorrplot)
library(GGally) # for pairs plot using ggplot framework
library(dplyr)
library(glmnet)
library(knitr)

# Get cars data from github repo
path <- "https://raw.githubusercontent.com/adityaranade/portfolio/refs/heads/main/cars/autompg.data"
data0 <- read.table(path, fill = TRUE, header = FALSE)

colnames(data0) <- c("mpg","cylinders","displacement",
                     "horsepower","weight","acceleration",
                     "model_year","origin","car_name")

# Check the type of data
data0 |> str()

'data.frame':   398 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : chr  "130.0" "165.0" "150.0" "150.0" ...
 $ weight      : num  3504 3693 3436 3433 3449 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ model_year  : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ car_name    : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

# Convert horsepower to numeric
data0$horsepower <- as.numeric(data0$horsepower)

# Check the type of data again
data0 |> str()

'data.frame':   398 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
 $ weight      : num  3504 3693 3436 3433 3449 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ model_year  : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ car_name    : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...

# Check the rows which do not have any entries
sum(is.na(data0)) # 6 NA values

[1] 6

# Exclude the rows with missing information
data1 <- na.omit(data0)
sum(is.na(data1)) # no NA values

[1] 0

# Check the first 6 rows of the dataset
data1 |> head()

  mpg cylinders displacement horsepower weight acceleration model_year origin
1  18         8          307        130   3504         12.0         70      1
2  15         8          350        165   3693         11.5         70      1
3  18         8          318        150   3436         11.0         70      1
4  16         8          304        150   3433         12.0         70      1
5  17         8          302        140   3449         10.5         70      1
6  15         8          429        198   4341         10.0         70      1
                   car_name
1 chevrolet chevelle malibu
2         buick skylark 320
3        plymouth satellite
4             amc rebel sst
5               ford torino
6          ford galaxie 500

The distributions of the continuous variables on the original scale indicates some non linear relationships between the response variable mpg and the other variables. So we convert the data to log scale and the relationships become close to linear. Hence we will use the data on log scale for predictions. The distribution of the data on the log scale is as follows

# Transform the data to log scale
# exclude the last column which is car name
data <- data1[,-ncol(data1)]

# Pairs plot between the explanatory variables to 
# check correlation between each pair of the variables
ggpairs(data)

The response variable, mpg is correlated with all the variables which is good. However, the explanatory variables are correlated within themselves which is not a good indication. This indicates there is some multicollinearity. This means two variables give similar information about the response variable. One way to mitigate the effect is to consider the principal components and then use the principal components for the models. Another way is to use some regularization to mitigate the effect of multicollinearity.

# Transform the data to log scale
data <- data1[,-ncol(data1)] |> log()

# Pairs plot between the explanatory variables to 
# check correlation between each pair of the variables
ggpairs(data)

# split the data into training and testing data
seed <- 23
set.seed(seed)

ind <- sample(floor(0.8*nrow(data)),
              replace = FALSE)

# Training dataset
data_train <- data[ind,-ncol(data)]
# Testing dataset
data_test <- data[-ind,-ncol(data)]

First, we will look at a multiple linear regression model

# Fit a multiple linear regression model
model_lm <- glm(mpg ~ ., data = data_train)

# Check the summary of the model
model_lm |> summary()


Call:
glm(formula = mpg ~ ., data = data_train)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1.49694    0.74566   2.008 0.045573 *  
cylinders    -0.06673    0.06434  -1.037 0.300499    
displacement -0.04162    0.05169  -0.805 0.421340    
horsepower   -0.25381    0.05798  -4.378 1.65e-05 ***
weight       -0.59676    0.08376  -7.125 7.51e-12 ***
acceleration -0.22290    0.06073  -3.670 0.000286 ***
model_year    1.94813    0.16281  11.966  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.01097963)

    Null deviance: 30.4993  on 312  degrees of freedom
Residual deviance:  3.3598  on 306  degrees of freedom
AIC: -514.99

Number of Fisher Scoring iterations: 2

# Prediction on the testing dataset
y_pred_lm <- predict(model_lm, data_test)

# Data frame for observed vs predicted
df_pred_mlr <- data.frame(predicted = y_pred_lm, 
                    observed = data_test$mpg)
df_pred_mlr$model <- "mlr"

# Evaluation metrics
rmse_lm <- (data_test$mpg-y_pred_lm) |> mean() |> sqrt()
mae_lm <- (data_test$mpg-y_pred_lm) |> abs() |> mean()
r2_lm   <- 1 - sum((data_test$mpg - y_pred_lm)^2) / sum((data_test$mpg - mean(data_test$mpg))^2)

Next, we will try the lasso regression which uses the \(L^1\) penalty.

# Lasso regression (L1 penalty)
model_l1_cv <- cv.glmnet(as.matrix(data_train[,-1]),
                         as.matrix(data_train[,1]),
                         alpha = 0)

#find optimal lambda value that minimizes test MSE
best_lambda_l1 <- model_l1_cv$lambda.min
best_lambda_l1

[1] 0.02813141

model_l1 <- glmnet(as.matrix(data_train[,-1]),
                   as.matrix(data_train[,1]),
                   alpha = 0, 
                   lambda = best_lambda_l1)

# Coefficients of the lasso regression model 
coef(model_l1)

7 x 1 sparse Matrix of class "dgCMatrix"
                     s0
(Intercept)   1.1562832
cylinders    -0.1119517
displacement -0.1014371
horsepower   -0.2373315
weight       -0.4181040
acceleration -0.1806215
model_year    1.7415423

# Prediction on the testing dataset
y_pred_l1 <- predict(model_l1,  s = best_lambda_l1,
                     newx= as.matrix(data_test[,-1]))

# Data frame for observed vs predicted
df_pred_l1 <- data.frame(predicted = as.vector(y_pred_l1), 
                    observed = data_test$mpg)
df_pred_l1$model <- "lasso"

# Evaluation metrics
rmse_l1 <- (data_test$mpg-y_pred_l1)^2 |> mean() |> sqrt()
mae_l1 <- (data_test$mpg-y_pred_l1) |> abs() |> mean()
r2_l1   <- 1 - sum((data_test$mpg - y_pred_l1)^2) / sum((data_test$mpg - mean(data_test$mpg))^2)

Next, we will try the ridge regression which uses the \(L^2\) penalty.

# Ridge regression (L2 penalty)
model_l2_cv <- cv.glmnet(as.matrix(data_train[,-1]),
                         as.matrix(data_train[,1]),
                         alpha = 1)

#find optimal lambda value that minimizes test MSE
best_lambda <- model_l2_cv$lambda.min
best_lambda

[1] 0.0006060729

model_l2 <- glmnet(as.matrix(data_train[,-1]),
                   as.matrix(data_train[,1]),
                   alpha = 1, 
                   lambda = best_lambda)

# Coefficients of the ridge regression model 
coef(model_l2)

7 x 1 sparse Matrix of class "dgCMatrix"
                      s0
(Intercept)   1.53082771
cylinders    -0.06263596
displacement -0.04075770
horsepower   -0.24240969
weight       -0.60741542
acceleration -0.20551149
model_year    1.93416769

# Prediction on the testing dataset
y_pred_l2 <- predict(model_l2,  s = best_lambda,
                     newx= as.matrix(data_test[,-1]))

# Data frame for observed vs predicted
df_pred_l2 <- data.frame(predicted = as.vector(y_pred_l2), 
                    observed = data_test$mpg)
df_pred_l2$model <- "ridge"

# Evaluation metrics
rmse_l2 <- (data_test$mpg-y_pred_l2)^2 |> mean() |> sqrt()
mae_l2 <- (data_test$mpg-y_pred_l2) |> abs() |> mean()
r2_l2   <- 1 - sum((data_test$mpg - y_pred_l2)^2) / sum((data_test$mpg - mean(data_test$mpg))^2)

Next, we will try the elastic net regression which is a combination of lasso (\(L^1\) penalty) and ridge (\(L^2\) penalty) regression.

# Elastic net
model_en_cv <- cv.glmnet(as.matrix(data_train[,-1]),
                         as.matrix(data_train[,1]),
                         alpha = 0.5)

#find optimal lambda value that minimizes test MSE
best_lambda_en <- model_en_cv$lambda.min
best_lambda_en

[1] 0.001212146

model_en <- glmnet(as.matrix(data_train[,-1]),
                   as.matrix(data_train[,1]),
                   alpha = 0.5, 
                   lambda = best_lambda_en)
coef(model_en)

7 x 1 sparse Matrix of class "dgCMatrix"
                      s0
(Intercept)   1.51253268
cylinders    -0.06214793
displacement -0.04589088
horsepower   -0.24457980
weight       -0.59592496
acceleration -0.20734991
model_year    1.92659723

model_en |> summary()

          Length Class     Mode   
a0        1      -none-    numeric
beta      6      dgCMatrix S4     
df        1      -none-    numeric
dim       2      -none-    numeric
lambda    1      -none-    numeric
dev.ratio 1      -none-    numeric
nulldev   1      -none-    numeric
npasses   1      -none-    numeric
jerr      1      -none-    numeric
offset    1      -none-    logical
call      5      -none-    call   
nobs      1      -none-    numeric

# Prediction on the testing dataset
y_pred_en <- predict(model_en,  s = best_lambda_en,
                     newx= as.matrix(data_test[,-1]))

# Data frame for observed vs predicted
df_pred_en <- data.frame(predicted = as.vector(y_pred_en), 
                    observed = data_test$mpg)
df_pred_en$model <- "elastic_net"

# Evaluation metrics
rmse_en <- (data_test$mpg-y_pred_en)^2 |> mean() |> sqrt()
mae_en <- (data_test$mpg-y_pred_en) |> abs() |> mean()
r2_en   <- 1 - sum((data_test$mpg - y_pred_en)^2) / sum((data_test$mpg - mean(data_test$mpg))^2)

Next, we will try the tree based approach.

# Tree approach
library(rpart)
library(rpart.plot)

# Fit regression tree
model_tree <- rpart(mpg ~ ., data = data_train, method = "anova")

# summary(model_tree)

# Plot
rpart.plot(model_tree, type = 3, extra = 101, fallen.leaves = TRUE)

y_pred_tree <- predict(model_tree, data_test)

# Data frame for observed vs predicted
df_pred_tree <- data.frame(predicted = y_pred_tree, 
                    observed = data_test$mpg)
df_pred_tree$model <- "tree"

# Evaluation metrics
rmse_tree <- (data_test$mpg-y_pred_tree)^2 |> mean() |> sqrt()
mae_tree <- (data_test$mpg-y_pred_tree) |> abs() |> mean()
r2_tree   <- 1 - sum((data_test$mpg - y_pred_tree)^2) / sum((data_test$mpg - mean(data_test$mpg))^2)

Next, we will try the random forest approach. In random forest approach, we build multiple trees and then average the predictions of all the trees.

# Random forest
library(randomForest)
model_rf <- randomForest(mpg ~ ., data = data_train)

y_pred_rf <- predict(model_rf, data_test)

# Data frame for observed vs predicted
df_pred_rf <- data.frame(predicted = y_pred_rf, 
                    observed = data_test$mpg)
df_pred_rf$model <- "random_forest"

# Evaluation metrics
rmse_rf <- (data_test$mpg-y_pred_rf)^2 |> mean() |> sqrt()
mae_rf <- (data_test$mpg-y_pred_rf) |> abs() |> mean()
r2_rf   <- 1 - sum((data_test$mpg - y_pred_rf)^2) / sum((data_test$mpg - mean(data_test$mpg))^2)

Next, we will try the support vector machine (SVM) approach.

library(e1071)
model_svm <- svm(mpg ~ ., data = data_train, 
                 kernel = "radial", cost = 10, 
                 gamma = 0.1)

# Predict on test data
y_pred_svm <- predict(model_svm, data_test)

# Data frame for observed vs predicted
df_pred_svm <- data.frame(predicted = y_pred_svm, 
                    observed = data_test$mpg)
df_pred_svm$model <- "svm"

# Evaluation metrics
rmse_svm <- (data_test$mpg-y_pred_svm)^2 |> mean() |> sqrt()
mae_svm <- (data_test$mpg-y_pred_svm) |> abs() |> mean()
r2_svm   <- 1 - sum((data_test$mpg - y_pred_svm)^2) / sum((data_test$mpg - mean(data_test$mpg))^2)

Next, we will try the extreme gradiant boosting (xgboost) approach.

# xgboost
library(xgboost)
model_xgb <- xgboost(as.matrix(data_train[,-1]),
                     as.matrix(data_train[,1]),
                     objective = "reg:squarederror",
                     nrounds = 100,
                     verbose = 0)

# Predict on test data
y_pred_xgb <- predict(model_xgb, as.matrix(data_test[,-1]))

# Data frame for observed vs predicted
df_pred_xgb <- data.frame(predicted = y_pred_xgb, 
                    observed = data_test$mpg)
df_pred_xgb$model <- "xgboost"

# Evaluation metrics
rmse_xgb <- (data_test$mpg-y_pred_xgb)^2 |> mean() |> sqrt()
mae_xgb <- (data_test$mpg-y_pred_xgb) |> abs() |> mean()
r2_xgb   <- 1 - sum((data_test$mpg - y_pred_xgb)^2) / sum((data_test$mpg - mean(data_test$mpg))^2)

The observed vs. predicted for all the models side by side can be seen in the plot below

# Plot observed vs. predicted for all the models
df_pred <- rbind(df_pred_mlr,df_pred_l1,df_pred_l2,
                 df_pred_en, df_pred_tree, df_pred_rf,
                 df_pred_svm,df_pred_xgb)

# Create a observed vs. predicted plot combined for all the models
ggplot(df_pred,aes(predicted,observed))+geom_point()+
  lims(x = c(2.5,4) , y = c(2.5,4))+
  labs(y = "Observed", x="Predicted")+
  facet_grid(~model, scales="free")+
  geom_abline()+
  theme_bw(base_size = 15)

Combined evaluation metrics to compare all the models can be seen in the table below.

# Evaluation metrics for all the models
metrics_all <- data.frame(Model = c("linear_model", "lasso", "ridge",
                                 "elastic_net", "tree", "random_forest",
                                 "svm","xgboost"),
                       RMSE = c(rmse_lm,rmse_l1, rmse_l2, rmse_en, 
                                rmse_tree, rmse_rf, rmse_svm, rmse_xgb),
                       MAE = c(mae_lm,mae_l1, mae_l2, mae_en, 
                                mae_tree, mae_rf, mae_svm, mae_xgb),
                       R_squared = c(r2_lm, r2_l1, r2_l2, r2_en, 
                                r2_tree, r2_rf, r2_svm, r2_xgb))

# Print evaluation metrics
kable(metrics_all, digits = 2, caption = "Model Performance Metrics")

Model Performance Metrics
Model	RMSE	MAE	R_squared
linear_model	0.26	0.12	0.38
lasso	0.16	0.12	0.33
ridge	0.16	0.12	0.38
elastic_net	0.16	0.12	0.38
tree	0.24	0.19	-0.52
random_forest	0.18	0.14	0.17
svm	0.15	0.11	0.45
xgboost	0.16	0.12	0.32

Visualization of the combined evaluation metrics can be seen in the plot below.

# To plot RMSE and MAE on the same plot
metrics_long <- melt(metrics_all, id.vars = "Model", 
                     variable.name = "Metric", 
                     value.name = "Value")

ggplot(metrics_long, aes(x = Model, y = Value, fill = Model)) +
  geom_col(show.legend = FALSE) +
  geom_text(aes(label = round(Value, 2)), 
            #position = position_dodge(width = 0.8), 
            size = 3) +
  coord_flip(clip = "off") +  # horizontal bars, no clipping of text
  facet_grid2(~Metric, scales="free")+
  labs(title = "Model Performance Comparison", 
       y = "Error Value", x = "Model") +
  theme_bw(base_size = 10) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Based on the evaluation metrics and observed vs. predicted plot, support vector machine model seems to be the best model.