We present a complete, reproducible pipeline and empirical study for predicting hospital readmission within 30 days using the Diabetes 130-US Hospitals dataset. Our work focuses on practical preprocessing, robust imbalance handling, and a direct comparison of interpretable models (logistic regression), parametric neural models (multilayer perceptrons, MLP), and tree-ensemble approaches (XGBoost). The pipeline includes end-to-end engineering steps such as feature cleaning, encoding, scaling, oversampling via SMOTE, and feature selection using gradient-boosted gain scores. We evaluate models extensively through ROC and PR curves, loss curves, feature importance analyses, confusion matrices, and threshold tuning. We report empirical performance, highlight trade-offs between interpretability and predictive power, and discuss next steps involving temporal modeling, calibration, and causal evaluation.
Hospital readmission prediction is a longstanding operational and clinical challenge. Identifying patients at elevated risk allows clinicians to intervene and potentially reduce both unnecessary hospitalizations and associated healthcare costs. Electronic Health Records (EHRs) provide a rich but heterogeneous representation of patients’ clinical histories, encompassing demographics, admission metadata, procedures, medications, and laboratory results. However, prediction from EHR data is complicated by missing values, noisy and high-cardinality diagnosis codes, and substantial class imbalance in short-term readmission events. In this study, we develop a reproducible pipeline for predicting 30-day hospital readmissions. We evaluate three modeling paradigms: logistic regression as a linear baseline, a feedforward neural network (MLP) representing nonlinear parametric methods, and gradient-boosted trees (XGBoost) as a strong ensemble learner for structured data. Our contributions are: (i) a fully engineered preprocessing pipeline tailored for EHRs, (ii) a comparative experimental evaluation of classical and ensemble models, and (iii) a discussion of interpretability, operational trade-offs, and future directions.
We compare three supervised learning paradigms that represent distinct modeling philosophies: (i) logistic regression as a linear and interpretable baseline, (ii) multilayer perceptron (MLP) as a flexible parametric neural network, and (iii) XGBoost, a non-parametric ensemble tree-boosting approach. This combination allows us to explore the trade-offs between interpretability, nonlinearity, and predictive performance in the context of healthcare readmission prediction.
Logistic regression (LogReg) is a generalized linear model that maps input features to a probability of readmission through a logit link function. It estimates the conditional probability of the positive class as:
$$P(Y=1 \mid x) = \sigma(w^\top x + b),$$
where \(\sigma(z) = \frac{1}{1+e^{-z}}\) is the sigmoid activation, \(w\) denotes the coefficient vector, and \(b\) is the intercept.
The model is trained by minimizing the regularized binary cross-entropy loss:
$$\mathcal{L}_{\text{logreg}} = - \frac{1}{N}\sum_{i=1}^N \left[ y_i \log \hat{y}_i + (1-y_i)\log(1-\hat{y}_i)\right] + \lambda \|w\|_2^2,$$
where \(\lambda\) controls the strength of L2 regularization to prevent overfitting.
Logistic regression was chosen as a baseline due to its transparency: coefficients \(w_j\) can be directly interpreted as log-odds ratios, providing clinicians with an intuitive explanation of how individual risk factors contribute to readmission probability.
To capture nonlinear dependencies that linear models cannot, we implemented a feedforward multilayer perceptron (MLP). The MLP applies successive affine transformations followed by nonlinear activations:
$$h_1 = \text{ReLU}(W_1x+b_1), \quad h_2 = \text{ReLU}(W_2h_1+b_2), \quad \hat{y} = \sigma(W_3h_2+b_3).$$
Here, \(h_1\) and \(h_2\) are hidden representations, \(\text{ReLU}(z) = \max(0,z)\) is the activation function, and the final output is squashed into \([0,1]\) using the sigmoid.
We employed two hidden layers of size 64 and 32, balancing model expressivity with computational tractability. Training was performed using stochastic gradient descent with backpropagation and early stopping, to avoid overfitting in the presence of limited minority-class data. Unlike LogReg, MLPs are less interpretable, but they allow complex interactions between heterogeneous EHR features (e.g., nonlinear effects of age, comorbidity, and prior visits) to be modeled effectively.
XGBoost is a scalable and regularized gradient boosting algorithm for decision trees. It constructs an additive model:
$$\hat{y}_i = \sum_{k=1}^K f_k(x_i), \quad f_k \in \mathcal{F},$$
where \(\mathcal{F}\) is the space of regression trees, and \(K\) is the number of boosting rounds. Each tree \(f_k\) is trained to minimize the following regularized objective:
$$\mathcal{L} = \sum_{i=1}^N l(y_i, \hat{y}_i) + \sum_{k=1}^K \Omega(f_k),$$
with \(\Omega(f) = \gamma T + \tfrac{1}{2}\lambda \|w\|^2\) penalizing tree complexity through the number of leaves \(T\) and leaf weights \(w\).
By sequentially fitting trees to the residuals of previous models, XGBoost effectively captures nonlinear feature interactions and hierarchical effects common in structured healthcare data. It also includes built-in handling of missing values and a class-weighting mechanism, making it well suited to clinical prediction tasks with imbalanced data.
The dataset exhibits severe imbalance: only about 11% of admissions correspond to a 30-day readmission event. To address this, we employed the Synthetic Minority Over-sampling Technique (SMOTE), which generates synthetic minority-class samples via convex combinations of existing instances. For two minority samples \(x_i\) and \(x_j\), a new synthetic point is created as:
$$x_{\text{new}} = x_i + \alpha (x_j - x_i), \quad \alpha \sim U(0,1).$$
This balances the training set, allowing classifiers to learn more robust decision boundaries. In
addition, for XGBoost we set the scale_pos_weight hyperparameter to:
$$\text{scale_pos_weight} = \frac{\#\{Y=0\}}{\#\{Y=1\}},$$
ensuring the loss function penalizes false negatives more heavily, thus improving sensitivity to readmission cases.
The figure below shows ROC curves comparing Logistic Regression, Multilayer Perceptron (MLP), and XGBoost on the held-out test set. Consistent with expectations for tabular EHR data, XGBoost achieved the strongest overall discriminative ability, yielding the highest ROC AUC (0.62) and PR AUC (0.42).
The table below summarizes the primary evaluation metrics. Logistic Regression and MLP both reached the highest F1 score (0.93), but their performance profiles differ: Logistic Regression favors higher precision (0.89) at slightly lower recall (0.97), whereas MLP achieves stronger recall (0.98) but without additional gain in precision. XGBoost, while superior in ROC/PR AUC, demonstrates a sharp trade-off, excelling in precision (0.93) but with a notably lower recall (0.40), leading to a reduced F1 score (0.56).
These differences highlight that model choice depends not only on discriminative metrics such as AUC, but also on operational trade-offs between sensitivity and specificity in deployment contexts.
| Model | ROC AUC | PR AUC | Precision | Recall | F1 |
|---|---|---|---|---|---|
| Logistic Regression | 0.56 | 0.35 | 0.89 | 0.97 | 0.93 |
| MLP | 0.59 | 0.38 | 0.89 | 0.98 | 0.93 |
| XGBoost | 0.62 | 0.42 | 0.93 | 0.40 | 0.56 |
The following figures show detailed classification reports for each model:
Beyond predictive performance, understanding which features drive readmission risk is crucial for clinical decision-making. The figure below shows the top 30 most influential features from the logistic regression model, ranked by their coefficient magnitudes. These coefficients represent the log-odds contribution of each feature to the readmission probability.
Our comparative analysis reveals important trade-offs in model selection for hospital readmission prediction. While XGBoost achieves superior discriminative performance (ROC AUC: 0.62), logistic regression and MLP models provide better balance between precision and recall (F1: 0.93). The choice of model should align with deployment priorities: if the goal is to identify high-risk patients for targeted interventions, XGBoost's higher precision may be preferable despite lower recall. Conversely, if the objective is to ensure no at-risk patients are missed, the higher recall of MLP models may be more appropriate.
The SMOTE oversampling technique proved effective in addressing class imbalance, though future work could explore alternative approaches such as cost-sensitive learning or focal loss. Feature engineering and selection significantly impacted model performance, suggesting that domain expertise in identifying clinically relevant predictors is as important as algorithm choice.
Several promising directions for future research include:
This project was completed as part of the Data Mining course at Georgia State University. We thank our instructors and peers for their valuable feedback and support throughout this work. The dataset used in this study is publicly available from the UCI Machine Learning Repository.