Predicting productivity in garment factory

I found this dataset on UCI machine learning repository which gives the dataset regarding the productivity in a garment factory. One variable measures the target productivity which has been set by the management and the variable of interest is the actual productivity. The goal is to predict the actual predictivity using the other variables.

library(reshape2)
library(ggplot2)
library(ggh4x)
library(ggcorrplot)
library(GGally) # for pairs plot using ggplot framework
library(dplyr)
library(glmnet)
library(knitr)

# Get cars data from github repo
path <- "https://raw.githubusercontent.com/adityaranade/portfolio/refs/heads/main/productivity/productivity.csv"
data0 <- read.table(path, fill = TRUE, header = FALSE)

data0 <- read.csv(path, header = TRUE)

head(data0)

      date  quarter department      day team targeted_productivity   smv  wip
1 1/1/2015 Quarter1     sweing Thursday    8                  0.80 26.16 1108
2 1/1/2015 Quarter1 finishing  Thursday    1                  0.75  3.94   NA
3 1/1/2015 Quarter1     sweing Thursday   11                  0.80 11.41  968
4 1/1/2015 Quarter1     sweing Thursday   12                  0.80 11.41  968
5 1/1/2015 Quarter1     sweing Thursday    6                  0.80 25.90 1170
6 1/1/2015 Quarter1     sweing Thursday    7                  0.80 25.90  984
  over_time incentive idle_time idle_men no_of_style_change no_of_workers
1      7080        98         0        0                  0          59.0
2       960         0         0        0                  0           8.0
3      3660        50         0        0                  0          30.5
4      3660        50         0        0                  0          30.5
5      1920        50         0        0                  0          56.0
6      6720        38         0        0                  0          56.0
  actual_productivity
1           0.9407254
2           0.8865000
3           0.8005705
4           0.8005705
5           0.8003819
6           0.8001250

# Check the type of data
data0 |> str()

'data.frame':   1197 obs. of  15 variables:
 $ date                 : chr  "1/1/2015" "1/1/2015" "1/1/2015" "1/1/2015" ...
 $ quarter              : chr  "Quarter1" "Quarter1" "Quarter1" "Quarter1" ...
 $ department           : chr  "sweing" "finishing " "sweing" "sweing" ...
 $ day                  : chr  "Thursday" "Thursday" "Thursday" "Thursday" ...
 $ team                 : int  8 1 11 12 6 7 2 3 2 1 ...
 $ targeted_productivity: num  0.8 0.75 0.8 0.8 0.8 0.8 0.75 0.75 0.75 0.75 ...
 $ smv                  : num  26.16 3.94 11.41 11.41 25.9 ...
 $ wip                  : int  1108 NA 968 968 1170 984 NA 795 733 681 ...
 $ over_time            : int  7080 960 3660 3660 1920 6720 960 6900 6000 6900 ...
 $ incentive            : int  98 0 50 50 50 38 0 45 34 45 ...
 $ idle_time            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ idle_men             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ no_of_style_change   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ no_of_workers        : num  59 8 30.5 30.5 56 56 8 57.5 55 57.5 ...
 $ actual_productivity  : num  0.941 0.886 0.801 0.801 0.8 ...

# Check the rows which do not have any entries
sum(is.na(data0)) # 506 NA values

[1] 506

# Delete the NA values
data1 <- na.omit(data0)
sum(is.na(data1)) # no NA values

[1] 0

The distributions of the continuous variables on the original scale indicates some non linear relationships between the response variable actual_productivity and the other variables. So we convert the data to log scale and the relationships become close to linear. Hence we will use the data on log scale for predictions. The distribution of the data on the log scale is as follows

# Pairs plot between the explanatory variables to 
# check correlation between each pair of the 
# continuous variables
ggpairs(data1[,sapply(data1,is.numeric)])

The response variable, actual_productivity is correlated with all the variables which is good. However, the explanatory variables are correlated within themselves which is not a good indication. This indicates there is some multicollinearity. This means two variables give similar information about the response variable. One way to mitigate the effect is to consider the principal components and then use the principal components for the models. Another way is to use some regularization to mitigate the effect of multicollinearity.

# split the data into training and testing data
dataa <- data1 |> dplyr::select(targeted_productivity, incentive, 
                        over_time, actual_productivity)

# If considering original scale
data <- dataa
# # If considering log scale
# dataa$over_time <- dataa$over_time + min(dataa$over_time[dataa$over_time > 0])
# dataa$incentive <- dataa$incentive + min(dataa$incentive[dataa$incentive > 0])
# data <- dataa |> log()

# Split data in training and testing set
seed <- 55
set.seed(seed)

ind <- sample(floor(0.8*nrow(data)),
              replace = FALSE)

# Training dataset
data_train <- data[ind,]
# Testing dataset
data_test <- data[-ind,]

Based on the correlation between pairs of variables and the relationship of each variable, it seems difficult to identify which variables will be useful in the regression model. So, we will look at variable which logically should impact the productivity. The variables we consider for the analysis are targeted productivity, incentive and overtime. First, we will look at a multiple linear regression model

# Fit a multiple linear regression model
model_lm <- glm(actual_productivity ~ ., data = data_train)

# Check the summary of the model
model_lm |> summary()


Call:
glm(formula = actual_productivity ~ ., data = data_train)

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)            1.377e-01  2.624e-02   5.250 2.18e-07 ***
targeted_productivity  6.146e-01  3.641e-02  16.883  < 2e-16 ***
incentive              3.480e-03  1.287e-04  27.045  < 2e-16 ***
over_time             -2.763e-06  1.107e-06  -2.497   0.0128 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.005828163)

    Null deviance: 14.5158  on 551  degrees of freedom
Residual deviance:  3.1938  on 548  degrees of freedom
AIC: -1267.6

Number of Fisher Scoring iterations: 2

# Prediction on the testing dataset
y_pred_lm <- predict(model_lm, data_test)

# Data frame for observed vs predicted
df_pred_mlr <- data.frame(predicted = y_pred_lm, 
                    observed = data_test$actual_productivity)
df_pred_mlr$model <- "mlr"

# Evaluation metrics
rmse_lm <- sqrt(mean((data_test$actual_productivity-y_pred_lm)^2))
mae_lm <- mean(abs(data_test$actual_productivity-y_pred_lm))
r2_lm   <- 1 - sum((data_test$actual_productivity - y_pred_lm)^2) / sum((data_test$actual_productivity - mean(data_test$actual_productivity))^2)

Next, we will try the lasso regression which uses the \(L^1\) penalty.

# Lasso regression (L1 penalty)
model_l1_cv <- cv.glmnet(as.matrix(data_train[,-1]),
                         as.matrix(data_train[,1]),
                         alpha = 0)

#find optimal lambda value that minimizes test MSE
best_lambda_l1 <- model_l1_cv$lambda.min
best_lambda_l1

[1] 0.007170258

model_l1 <- glmnet(as.matrix(data_train[,-1]),
                   as.matrix(data_train[,1]),
                   alpha = 0, 
                   lambda = best_lambda_l1)

# Coefficients of the lasso regression model 
coef(model_l1)

4 x 1 sparse Matrix of class "dgCMatrix"
                               s0
(Intercept)          4.224381e-01
incentive           -2.986123e-04
over_time           -1.624768e-06
actual_productivity  4.535232e-01

# Prediction on the testing dataset
y_pred_l1 <- predict(model_l1,  s = best_lambda_l1,
                     newx= as.matrix(data_test[,-1]))

# Data frame for observed vs predicted
df_pred_l1 <- data.frame(predicted = as.vector(y_pred_l1), 
                    observed = data_test$actual_productivity)
df_pred_l1$model <- "lasso"

# Evaluation metrics
rmse_l1 <- sqrt(mean((data_test$actual_productivity-y_pred_l1)^2))
mae_l1 <- mean(abs(data_test$actual_productivity-y_pred_l1))
r2_l1   <- 1 - sum((data_test$actual_productivity - y_pred_l1)^2) / sum((data_test$actual_productivity - mean(data_test$actual_productivity))^2)

Next, we will try the ridge regression which uses the \(L^2\) penalty.

# Ridge regression (L2 penalty)
model_l2_cv <- cv.glmnet(as.matrix(data_train[,-1]),
                         as.matrix(data_train[,1]),
                         alpha = 1)

#find optimal lambda value that minimizes test MSE
best_lambda <- model_l2_cv$lambda.min
best_lambda

[1] 0.0001544785

model_l2 <- glmnet(as.matrix(data_train[,-1]),
                   as.matrix(data_train[,1]),
                   alpha = 1, 
                   lambda = best_lambda)

# Coefficients of the ridge regression model 
coef(model_l2)

4 x 1 sparse Matrix of class "dgCMatrix"
                               s0
(Intercept)          3.692767e-01
incentive           -7.601171e-04
over_time           -1.152732e-06
actual_productivity  5.515953e-01

# Prediction on the testing dataset
y_pred_l2 <- predict(model_l2,  s = best_lambda,
                     newx= as.matrix(data_test[,-1]))

# Data frame for observed vs predicted
df_pred_l2 <- data.frame(predicted = as.vector(y_pred_l2), 
                    observed = data_test$actual_productivity)
df_pred_l2$model <- "ridge"

# Evaluation metrics
rmse_l2 <- sqrt(mean((data_test$actual_productivity-y_pred_l2)^2))
mae_l2 <- mean(abs(data_test$actual_productivity-y_pred_l2))
r2_l2   <- 1 - sum((data_test$actual_productivity - y_pred_l2)^2) / sum((data_test$actual_productivity - mean(data_test$actual_productivity))^2)

Next, we will try the elastic net regression which is a combination of lasso (\(L^1\) penalty) and ridge (\(L^2\) penalty) regression.

# Elastic net
model_en_cv <- cv.glmnet(as.matrix(data_train[,-1]),
                         as.matrix(data_train[,1]),
                         alpha = 0.5)

#find optimal lambda value that minimizes test MSE
best_lambda_en <- model_en_cv$lambda.min
best_lambda_en

[1] 0.0002129521

model_en <- glmnet(as.matrix(data_train[,-1]),
                   as.matrix(data_train[,1]),
                   alpha = 0.5, 
                   lambda = best_lambda_en)

# Coefficients of the elastic net regression model 
coef(model_en)

4 x 1 sparse Matrix of class "dgCMatrix"
                               s0
(Intercept)          3.696915e-01
incentive           -7.585067e-04
over_time           -1.169685e-06
actual_productivity  5.510806e-01

# Prediction on the testing dataset
y_pred_en <- predict(model_en,  s = best_lambda_en,
                     newx= as.matrix(data_test[,-1]))

# Data frame for observed vs predicted
df_pred_en <- data.frame(predicted = as.vector(y_pred_en), 
                    observed = data_test$actual_productivity)
df_pred_en$model <- "elastic_net"

# Evaluation metrics
rmse_en <- sqrt(mean((data_test$actual_productivity-y_pred_en)^2))
mae_en <- mean(abs(data_test$actual_productivity-y_pred_en))
r2_en   <- 1 - sum((data_test$actual_productivity - y_pred_en)^2) / sum((data_test$actual_productivity - mean(data_test$actual_productivity))^2)

Next, we will try the tree based approach.

# Tree approach
library(rpart)
library(rpart.plot)

# Fit regression tree
model_tree <- rpart(actual_productivity ~ ., data = data_train, method = "anova")

# summary(model_tree)

# Plot
rpart.plot(model_tree, type = 3, extra = 101, fallen.leaves = TRUE)

y_pred_tree <- predict(model_tree, data_test)

# Data frame for observed vs predicted
df_pred_tree <- data.frame(predicted = y_pred_tree, 
                    observed = data_test$actual_productivity)
df_pred_tree$model <- "tree"

# Evaluation metrics
rmse_tree <- sqrt(mean((data_test$actual_productivity-y_pred_tree)^2))
mae_tree <- mean(abs(data_test$actual_productivity-y_pred_tree))
r2_tree   <- 1 - sum((data_test$actual_productivity - y_pred_tree)^2) / sum((data_test$actual_productivity - mean(data_test$actual_productivity))^2)

Next, we will try the random forest approach. In random forest approach, we build multiple trees and then average the predictions of all the trees.

# Random forest
library(randomForest)
model_rf <- randomForest(actual_productivity ~ ., data = data_train)

y_pred_rf <- predict(model_rf, data_test)

# Data frame for observed vs predicted
df_pred_rf <- data.frame(predicted = y_pred_rf, 
                    observed = data_test$actual_productivity)
df_pred_rf$model <- "random_forest"

# Evaluation metrics
rmse_rf <- sqrt(mean((data_test$actual_productivity-y_pred_rf)^2))
mae_rf <- mean(abs(data_test$actual_productivity-y_pred_rf))
r2_rf   <- 1 - sum((data_test$actual_productivity - y_pred_rf)^2) / sum((data_test$actual_productivity - mean(data_test$actual_productivity))^2)

Next, we will try the support vector machine (SVM) approach.

library(e1071)
model_svm <- svm(actual_productivity ~ ., data = data_train, 
                 kernel = "radial", cost = 10, 
                 gamma = 0.1)

# Predict on test data
y_pred_svm <- predict(model_svm, data_test)

# Data frame for observed vs predicted
df_pred_svm <- data.frame(predicted = y_pred_svm, 
                    observed = data_test$actual_productivity)
df_pred_svm$model <- "svm"

# Evaluation metrics
rmse_svm <- sqrt(mean((data_test$actual_productivity-y_pred_svm)^2))
mae_svm <- mean(abs(data_test$actual_productivity-y_pred_svm))
r2_svm   <- 1 - sum((data_test$actual_productivity - y_pred_svm)^2) / sum((data_test$actual_productivity - mean(data_test$actual_productivity))^2)

Next, we will try the xgboost (Extreme Gradient Boosting) which is a boosting algorithm which uses trains models sequentially.

# xgboost
library(xgboost)
model_xgb <- xgboost(as.matrix(data_train[,-ncol(data_train)]),
                     as.matrix(data_train[,ncol(data_train)]),
                     objective = "reg:squarederror",
                     nrounds = 100,
                     verbose = 0)

# Predict on test data
y_pred_xgb <- predict(model_xgb, as.matrix(data_test[,-ncol(data_test)]))

# Data frame for observed vs predicted
df_pred_xgb <- data.frame(predicted = y_pred_xgb, 
                    observed = data_test$actual_productivity)
df_pred_xgb$model <- "xgboost"

# Evaluation metrics
rmse_xgb <- sqrt(mean((data_test$actual_productivity-y_pred_xgb)^2))
mae_xgb <- mean(abs(data_test$actual_productivity-y_pred_xgb))
r2_xgb   <- 1 - sum((data_test$actual_productivity - y_pred_xgb)^2) / sum((data_test$actual_productivity - mean(data_test$actual_productivity))^2)

abc <- (data_test$actual_productivity-y_pred_xgb)
abcd <- abc^2
sum(abcd)/length(abcd)

[1] 0.009577971

mean(abcd)

[1] 0.009577971

sqrt(mean(abcd))

[1] 0.09786711

rmse_xgb

[1] 0.09786711

sqrt(mean(data_test$actual_productivity-y_pred_xgb)^2)

[1] 0.005350966

The observed vs. predicted for all the models side by side can be seen in the plot below

# Plot observed vs. predicted for all the models
df_pred <- rbind(df_pred_mlr,df_pred_l1,df_pred_l2,
                 df_pred_en, df_pred_tree, df_pred_rf,
                 df_pred_svm, df_pred_xgb)

# Create a observed vs. predicted plot combined for all the models
ggplot(df_pred,aes(predicted,observed))+geom_point()+
  lims(x = c(0,1.15) , y = c(0,1.15))+
  labs(y = "Observed", x="Predicted")+
  facet_grid(~model, scales="free")+
  geom_abline()+
  theme_bw(base_size = 10)

Combined evaluation metrics to compare all the models can be seen in the table below.

# Evaluation metrics for all the models
metrics_all <- data.frame(Model = c("linear_model", "lasso", "ridge",
                                 "elastic_net", "tree", "random_forest",
                                 "svm","xgboost"),
                       RMSE = c(rmse_lm,rmse_l1, rmse_l2, rmse_en, 
                                rmse_tree, rmse_rf, rmse_svm, rmse_xgb),
                       MAE = c(mae_lm,mae_l1, mae_l2, mae_en, 
                                mae_tree, mae_rf, mae_svm, mae_xgb),
                       R_squared = c(r2_lm, r2_l1, r2_l2, r2_en, 
                                r2_tree, r2_rf, r2_svm, r2_xgb))

# Print evaluation metrics
kable(metrics_all, digits = 4, caption = "Model Performance Metrics")

Model Performance Metrics
Model	RMSE	MAE	R_squared
linear_model	0.0644	0.0429	0.7117
lasso	0.0711	0.0484	0.6490
ridge	0.0670	0.0464	0.6880
elastic_net	0.0670	0.0465	0.6877
tree	0.0780	0.0492	0.5776
random_forest	0.0728	0.0428	0.6317
svm	0.0739	0.0408	0.6201
xgboost	0.0979	0.0526	0.3345

Visualization of the combined evaluation metrics can be seen in the plot below.

# To plot RMSE and MAE on the same plot
metrics_long <- melt(metrics_all, id.vars = "Model", 
                     variable.name = "Metric", 
                     value.name = "Value")

ggplot(metrics_long, aes(x = Model, y = Value, fill = Model)) +
  geom_col(show.legend = TRUE) +
  geom_text(aes(label = round(Value, 4)), 
            #position = position_dodge(width = 0.8), 
            size = 3) +
  coord_flip(clip = "off") +  # horizontal bars, no clipping of text
  facet_grid2(~Metric, scales="free")+
  labs(title = "Model Performance Comparison", 
       y = "Error Value", x = "Model") +
  theme_bw(base_size = 10) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Based on the evaluation metrics and observed vs. predicted plot, support vector machine and tree model seems to be the optimum models.