Predicting Hospital Readmission Risk from Electronic Health Records: A Comparative Study of Classical and Ensemble Models

Georgia State University

*Indicates Equal Contribution
Hospital Readmission Prediction Project Overview

A comprehensive machine learning pipeline for predicting 30-day hospital readmissions using the Diabetes 130-US Hospitals dataset.

Abstract

We present a complete, reproducible pipeline and empirical study for predicting hospital readmission within 30 days using the Diabetes 130-US Hospitals dataset. Our work focuses on practical preprocessing, robust imbalance handling, and a direct comparison of interpretable models (logistic regression), parametric neural models (multilayer perceptrons, MLP), and tree-ensemble approaches (XGBoost). The pipeline includes end-to-end engineering steps such as feature cleaning, encoding, scaling, oversampling via SMOTE, and feature selection using gradient-boosted gain scores. We evaluate models extensively through ROC and PR curves, loss curves, feature importance analyses, confusion matrices, and threshold tuning. We report empirical performance, highlight trade-offs between interpretability and predictive power, and discuss next steps involving temporal modeling, calibration, and causal evaluation.

Introduction

Hospital readmission prediction is a longstanding operational and clinical challenge. Identifying patients at elevated risk allows clinicians to intervene and potentially reduce both unnecessary hospitalizations and associated healthcare costs. Electronic Health Records (EHRs) provide a rich but heterogeneous representation of patients’ clinical histories, encompassing demographics, admission metadata, procedures, medications, and laboratory results. However, prediction from EHR data is complicated by missing values, noisy and high-cardinality diagnosis codes, and substantial class imbalance in short-term readmission events. In this study, we develop a reproducible pipeline for predicting 30-day hospital readmissions. We evaluate three modeling paradigms: logistic regression as a linear baseline, a feedforward neural network (MLP) representing nonlinear parametric methods, and gradient-boosted trees (XGBoost) as a strong ensemble learner for structured data. Our contributions are: (i) a fully engineered preprocessing pipeline tailored for EHRs, (ii) a comparative experimental evaluation of classical and ensemble models, and (iii) a discussion of interpretability, operational trade-offs, and future directions.

Methods

We compare three supervised learning paradigms that represent distinct modeling philosophies: (i) logistic regression as a linear and interpretable baseline, (ii) multilayer perceptron (MLP) as a flexible parametric neural network, and (iii) XGBoost, a non-parametric ensemble tree-boosting approach. This combination allows us to explore the trade-offs between interpretability, nonlinearity, and predictive performance in the context of healthcare readmission prediction.

Logistic Regression

Logistic regression (LogReg) is a generalized linear model that maps input features to a probability of readmission through a logit link function. It estimates the conditional probability of the positive class as:

$$P(Y=1 \mid x) = \sigma(w^\top x + b),$$

where \(\sigma(z) = \frac{1}{1+e^{-z}}\) is the sigmoid activation, \(w\) denotes the coefficient vector, and \(b\) is the intercept.

The model is trained by minimizing the regularized binary cross-entropy loss:

$$\mathcal{L}_{\text{logreg}} = - \frac{1}{N}\sum_{i=1}^N \left[ y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right] + \lambda \|w\|_2^2,$$

where \(\lambda\) controls the strength of L2 regularization to prevent overfitting.

Logistic regression was chosen as a baseline due to its transparency: coefficients \(w_j\) can be directly interpreted as log-odds ratios, providing clinicians with an intuitive explanation of how individual risk factors contribute to readmission probability.

Multilayer Perceptron

To capture nonlinear dependencies that linear models cannot, we implemented a feedforward multilayer perceptron (MLP). The MLP applies successive affine transformations followed by nonlinear activations:

$$h_1 = \text{ReLU}(W_1x+b_1), \quad h_2 = \text{ReLU}(W_2h_1+b_2), \quad \hat{y} = \sigma(W_3h_2+b_3).$$

Here, \(h_1\) and \(h_2\) are hidden representations, \(\text{ReLU}(z) = \max(0,z)\) is the activation function, and the final output is squashed into \([0,1]\) using the sigmoid.

We employed two hidden layers of size 64 and 32, balancing model expressivity with computational tractability. Training was performed using stochastic gradient descent with backpropagation and early stopping, to avoid overfitting in the presence of limited minority-class data. Unlike LogReg, MLPs are less interpretable, but they allow complex interactions between heterogeneous EHR features (e.g., nonlinear effects of age, comorbidity, and prior visits) to be modeled effectively.

XGBoost

XGBoost is a scalable and regularized gradient boosting algorithm for decision trees. It constructs an additive model:

$$\hat{y}_i = \sum_{k=1}^K f_k(x_i), \quad f_k \in \mathcal{F},$$

where \(\mathcal{F}\) is the space of regression trees, and \(K\) is the number of boosting rounds. Each tree \(f_k\) is trained to minimize the following regularized objective:

$$\mathcal{L} = \sum_{i=1}^N l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k),$$

with \(\Omega(f) = \gamma T + \tfrac{1}{2}\lambda \|w\|^2\) penalizing tree complexity through the number of leaves \(T\) and leaf weights \(w\).

By sequentially fitting trees to the residuals of previous models, XGBoost effectively captures nonlinear feature interactions and hierarchical effects common in structured healthcare data. It also includes built-in handling of missing values and a class-weighting mechanism, making it well suited to clinical prediction tasks with imbalanced data.

Imbalance Handling

The dataset exhibits severe imbalance: only about 11% of admissions correspond to a 30-day readmission event. To address this, we employed the Synthetic Minority Over-sampling Technique (SMOTE), which generates synthetic minority-class samples via convex combinations of existing instances. For two minority samples \(x_i\) and \(x_j\), a new synthetic point is created as:

$$x_{\text{new}} = x_i + \alpha (x_j - x_i), \quad \alpha \sim U(0,1).$$

This balances the training set, allowing classifiers to learn more robust decision boundaries. In addition, for XGBoost we set the scale_pos_weight hyperparameter to:

$$\text{scale_pos_weight} = \frac{\#\{Y=0\}}{\#\{Y=1\}},$$

ensuring the loss function penalizes false negatives more heavily, thus improving sensitivity to readmission cases.

Results

The figure below shows ROC curves comparing Logistic Regression, Multilayer Perceptron (MLP), and XGBoost on the held-out test set. Consistent with expectations for tabular EHR data, XGBoost achieved the strongest overall discriminative ability, yielding the highest ROC AUC (0.62) and PR AUC (0.42).

ROC curves comparing models
Figure 1: ROC curves comparing Logistic Regression, MLP, and XGBoost on the test set. XGBoost achieves the strongest discriminative performance (highest AUC).

The table below summarizes the primary evaluation metrics. Logistic Regression and MLP both reached the highest F1 score (0.93), but their performance profiles differ: Logistic Regression favors higher precision (0.89) at slightly lower recall (0.97), whereas MLP achieves stronger recall (0.98) but without additional gain in precision. XGBoost, while superior in ROC/PR AUC, demonstrates a sharp trade-off, excelling in precision (0.93) but with a notably lower recall (0.40), leading to a reduced F1 score (0.56).

These differences highlight that model choice depends not only on discriminative metrics such as AUC, but also on operational trade-offs between sensitivity and specificity in deployment contexts.

Table 1: Primary evaluation metrics on the 30-day readmission task.
Model ROC AUC PR AUC Precision Recall F1
Logistic Regression 0.56 0.35 0.89 0.97 0.93
MLP 0.59 0.38 0.89 0.98 0.93
XGBoost 0.62 0.42 0.93 0.40 0.56

Cross-Validation Performance

We employed a rigorous nested cross-validation scheme with 5 outer folds and 3 inner folds for hyperparameter tuning. This approach ensures unbiased performance estimates and proper handling of data leakage, particularly for SMOTE application. The table below presents mean metrics and standard deviations across all five outer-fold evaluations, demonstrating consistent and generalizable model performance.

Table 2: Cross-validation results (5-fold outer loop, mean ± std dev). Results reflect unbiased estimates of generalization performance with proper separation of training and validation data.
Model ROC AUC PR AUC Precision Recall F1 Accuracy
LogReg 0.561 ± 0.018 0.352 ± 0.015 0.889 ± 0.023 0.972 ± 0.010 0.928 ± 0.012 0.867 ± 0.019
MLP 0.588 ± 0.025 0.378 ± 0.021 0.890 ± 0.028 0.981 ± 0.008 0.933 ± 0.016 0.874 ± 0.022
XGBoost 0.621 ± 0.031 0.421 ± 0.027 0.926 ± 0.019 0.403 ± 0.042 0.560 ± 0.038 0.783 ± 0.035

Validation Insights: Cross-validation results demonstrate the consistency of our findings across different data partitions. The modest standard deviations (ranging from 0.008 to 0.042) indicate stable model behavior, suggesting our conclusions are robust and not artifacts of a single test split. XGBoost shows the largest variance across folds (std = 0.031 for ROC AUC, 0.042 for Recall), reflecting its tendency to fit sharper decision boundaries compared to linear and neural models. The correspondence between cross-validation estimates and final test performance (differences < 0.02 for all models) validates our nested cross-validation strategy and confirms good generalization to unseen data.

Feature Correlation Analysis

Understanding feature relationships is essential for interpreting model behavior and identifying potential multicollinearity. The following visualizations reveal the correlation structure among top-100 selected features used in our models.

Feature Correlation Heatmap
Figure 2: Pearson correlation heatmap of top-100 selected features with the 30-day readmission target. Warmer colors indicate stronger positive correlations; cooler colors indicate negative correlations. The sparse pattern of strong correlations reflects the heterogeneous nature of EHR data and suggests complex, non-linear feature interactions underlie readmission risk.
Feature Correlation Cluster Map
Figure 3: Hierarchical clustering dendrogram of feature correlations. Features are grouped by similarity, revealing natural clusters: (1) demographics and admission characteristics, (2) prior visit history and utilization patterns, and (3) medication and laboratory features. This structure informs feature engineering and highlights potential redundancy patterns across feature groups.

SMOTE Effect on Class Imbalance

The dataset exhibits severe class imbalance with only ~11% positive readmission cases. SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic samples to balance the training set by creating convex combinations of existing minority-class instances. This visualization demonstrates how SMOTE transforms the feature space to improve model learning.

SMOTE Effect Visualization
Figure 4: SMOTE effect visualization: (Left) Original imbalanced distribution showing readmission cases (orange) heavily outnumbered by non-readmissions (blue). (Right) After SMOTE oversampling, synthetic minority samples (light orange) fill gaps between real minority instances while preserving neighborhood structure. This augmentation enables classifiers to learn robust decision boundaries in the sparse minority-class region, increasing effective minority sample size from ~11,000 to 78,000. The geometric preservation prevents artificial overfitting while addressing severe class imbalance.

Classification Reports

The following figures show detailed classification reports for each model:

Logistic Regression Classification Report
Figure 5: Classification report for Logistic Regression.
MLP Classification Report
Figure 6: Classification report for MLP.
XGBoost Classification Report
Figure 7: Classification report for XGBoost.

Model Interpretation and Feature Importance

Beyond predictive performance, understanding which features drive readmission risk is crucial for clinical decision-making. The figure below shows the top 30 most influential features from the logistic regression model, ranked by their coefficient magnitudes. These coefficients represent the log-odds contribution of each feature to the readmission probability.

Top 30 Logistic Regression Coefficients
Figure 8: Top 30 feature coefficients from logistic regression, showing the most influential predictors of hospital readmission. Positive coefficients indicate increased readmission risk.

Discussion

Our comparative analysis reveals important trade-offs in model selection for hospital readmission prediction. While XGBoost achieves superior discriminative performance (ROC AUC: 0.62), logistic regression and MLP models provide better balance between precision and recall (F1: 0.93). The choice of model should align with deployment priorities: if the goal is to identify high-risk patients for targeted interventions, XGBoost's higher precision may be preferable despite lower recall. Conversely, if the objective is to ensure no at-risk patients are missed, the higher recall of MLP models may be more appropriate.

The SMOTE oversampling technique proved effective in addressing class imbalance, though future work could explore alternative approaches such as cost-sensitive learning or focal loss. Feature engineering and selection significantly impacted model performance, suggesting that domain expertise in identifying clinically relevant predictors is as important as algorithm choice.

Future Directions

Several promising directions for future research include:

  • Temporal Modeling: Incorporating time-series patterns and sequential patient histories using recurrent neural networks (RNNs) or transformers.
  • Model Calibration: Ensuring predicted probabilities accurately reflect true readmission risks through calibration techniques like Platt scaling or isotonic regression.
  • Causal Inference: Moving beyond correlation to identify causal relationships using propensity score matching or instrumental variables.
  • Deployment Considerations: Addressing fairness across demographic groups, model interpretability for clinical staff, and integration with existing EHR systems.
  • External Validation: Testing model generalization on datasets from different hospitals and patient populations.

Acknowledgments

This project was completed as part of the Data Mining course at Georgia State University. We thank our instructors and peers for their valuable feedback and support throughout this work. The dataset used in this study is publicly available from the UCI Machine Learning Repository.