Predicting Gallstone Disease

I found this dataset on UCI machine learning repository which gives gallstone disease identification data. First, we look at the exploratory data analysis and later try to build predictive models like logistic regression, support vector classifier and k nearest neighbour model. First let us access and process the data through python.

# Load Libraries
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from plotnine import *
import numpy as np # linear algebra
# import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt
import seaborn as sns
import random
import openpyxl

# Get gallstone data from github repo
path = "https://raw.githubusercontent.com/adityaranade/portfolio/refs/heads/main/gallstone/gallstone_dataset.csv"
df0=pd.read_csv(path)

df0.head()

	Age	Comorbidity	Height	Weight	...	High Density Lipoprotein (HDL)	Triglyceride	Aspartat Aminotransferaz (AST)	Alanin Aminotransferaz (ALT)	Alkaline Phosphatase (ALP)	Creatinine	Glomerular Filtration Rate (GFR)	Hemoglobin (HGB)	Vitamin D
0	50	0	185	92.8	...	40.0	134.0	20.0	22.0	87.0	0.82	112.47	16.0	33.0
1	47	1	176	94.5	...	43.0	103.0	14.0	13.0	46.0	0.87	107.10	14.4	25.0
2	61	0	171	91.1	...	43.0	69.0	18.0	14.0	66.0	1.25	65.51	16.2	30.2
3	41	0	168	67.7	...	59.0	53.0	20.0	12.0	34.0	1.02	94.10	15.4	35.4
4	42	0	178	89.6	...	30.0	326.0	27.0	54.0	71.0	0.82	112.47	16.8	40.6

5 rows × 39 columns

# select specific columns
df = df0[["Gallstone Status","Vitamin D","Total Body Water (TBW)","Lean Mass (LM) (%)","C-Reactive Protein (CRP)"]]
# df.head()

# Use melt function for the histograms
df2 = pd.melt(df, id_vars=['Gallstone Status'])
# df2.head()

Now that we have the data ready, let us look at the histogram of each variables namely calories, fat, carbs, fiber, protein and sodium

p = (
ggplot(df2, aes("value"))
+ geom_histogram(bins=20)
+ facet_grid("Gallstone Status ~ variable", scales="free")+
  theme_bw()
)

p.show()

The histogram of each of the variables do not show any problems as all the plots look decent. We will look at the correlation plot.

# Correlation plot
plt.figure(figsize=(15,10))
sns.heatmap(df.iloc[:,1:].corr(),annot=True)
plt.show()

Correlation plot indicates weak association between all the variables which indicates there might not be severe multicolinearity. It does not warrant any cause of concern. Now we will look at the pairs plot which will show the pairwise relationship.

# Pairs plot
g = sns.PairGrid(df.iloc[:,1:])
g.map_diag(sns.histplot)
g.map_upper(sns.kdeplot)
g.map_lower(sns.scatterplot)
plt.show()

The scatterplots of each variable with which can be seen in the lower triangular plots. No strong association between any two variables. We will try to run different models like logistic regression, k nearest neighbours and support vector classification model on the data. First we will split the data into training (70%) and testing set (30%) and standardize the data.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split data into training and testing set
df_train0, df_test0 = train_test_split(df, test_size=0.3, random_state=23)

# Scale (exclude first column)
scaler = StandardScaler()

df_train = df_train0.copy()
df_test = df_test0.copy()

df_train.iloc[:, 1:] = scaler.fit_transform(df_train0.iloc[:, 1:])
df_test.iloc[:, 1:] = scaler.transform(df_test0.iloc[:, 1:])

X_train = df_train.iloc[:,1:]
y_train = df_train.iloc[:,0]
X_test = df_test.iloc[:,1:]
y_test = df_test.iloc[:,0]

Now we will run a logistic regression model first.

# Create logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred_logistic = model.predict(X_test)

# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_logistic))
print("Accuracy:", accuracy_score(y_test, y_pred_logistic))
print("\nClassification Report:\n", classification_report(y_test, y_pred_logistic))

Confusion Matrix:
 [[36 15]
 [11 34]]
Accuracy: 0.7291666666666666

Classification Report:
               precision    recall  f1-score   support

           0       0.77      0.71      0.73        51
           1       0.69      0.76      0.72        45

    accuracy                           0.73        96
   macro avg       0.73      0.73      0.73        96
weighted avg       0.73      0.73      0.73        96

from sklearn.metrics import roc_auc_score, roc_curve

y_prob = model.predict_proba(X_test)[:, 1]  # probability of positive class
auc = roc_auc_score(y_test, y_prob)
print("AUC:", auc)

fpr, tpr, _ = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, label=f"AUC = {auc:.2f}")
plt.plot([0, 1], [0, 1], linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.show()

AUC: 0.8113289760348584

Next, we will run a k nearest neighbour model.

# KNN model
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report

from sklearn.model_selection import cross_val_score
k_values = range(1, 11)
scores = []

for k in k_values:
  knn = KNeighborsClassifier(n_neighbors=k)
  score = cross_val_score(knn, X_train, y_train, cv=5).mean()
  scores.append(score)

best_k = k_values[np.argmax(scores)]
print("Best k:", best_k)

# Create KNN model using the best k
knn = KNeighborsClassifier(n_neighbors=best_k)
knn.fit(X_train, y_train)

# Predictions
y_pred_knn = knn.predict(X_test)

# Accuracy
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_knn))
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("\nClassification Report:\n", classification_report(y_test, y_pred_knn))

Best k: 7
Confusion Matrix:
 [[38 13]
 [12 33]]
Accuracy: 0.7395833333333334

Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.75      0.75        51
           1       0.72      0.73      0.73        45

    accuracy                           0.74        96
   macro avg       0.74      0.74      0.74        96
weighted avg       0.74      0.74      0.74        96

Next, we will run a support vector classification model.

# svc model
from sklearn.svm import SVC
# Identify the best parameter through CV
from sklearn.model_selection import GridSearchCV

param_grid = {
'C': [0.1, 1, 10],
'gamma': ['scale', 0.01, 0.1, 1],
'kernel': ['rbf', 'linear']
}

grid = GridSearchCV(SVC(), param_grid, refit=True, cv=5)
grid.fit(X_train, y_train)

print("Best parameters:", grid.best_params_)

# Create SVM model using the best parameters
svm_model = SVC(kernel='rbf', C=10, gamma=0.01)
svm_model.fit(X_train, y_train)

# Predictions
y_pred_svm = svm_model.predict(X_test)

# Evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_svm))
print("Accuracy:", accuracy_score(y_test, y_pred_svm))
print("\nClassification Report:\n", classification_report(y_test, y_pred_svm))

Best parameters: {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}
Confusion Matrix:
 [[38 13]
 [14 31]]
Accuracy: 0.71875

Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.75      0.74        51
           1       0.70      0.69      0.70        45

    accuracy                           0.72        96
   macro avg       0.72      0.72      0.72        96
weighted avg       0.72      0.72      0.72        96

# Combine the results of all the models.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Store all models
models = {
    "Logistic Regression": y_pred_logistic,
    "KNN": y_pred_knn,
    "SVM": y_pred_knn
}

# Calculate metrics for each model
results = []
for name, y_pred in models.items():
    results.append({
        "Model": name,
        "Accuracy": round(accuracy_score(y_test, y_pred),4),
        "Precision": round(precision_score(y_test, y_pred),4),
        "Recall": round(recall_score(y_test, y_pred),4),
        "F1 Score": round(f1_score(y_test, y_pred),4)
    })

# Create DataFrame
results_df = pd.DataFrame(results)
print(results_df)

                 Model  Accuracy  Precision  Recall  F1 Score
0  Logistic Regression    0.7292     0.6939  0.7556    0.7234
1                  KNN    0.7396     0.7174  0.7333    0.7253
2                  SVM    0.7396     0.7174  0.7333    0.7253

The combined evaluation metrics can be seen in the table above. It seems the k nearest neighbor model has the highest accuracy. However, the accuracy for the logistic regression and support vector machine models is not too bad. The R code for the analysis can be found here