Peer-ReviewedMDPI · Algorithms 2025 · 2025

A Big Data Pipeline Approach for Predicting Real-Time Pandemic Hospitalization Risk

Vishnu S. Pendyala, Mayank Kapadia, Basanth Periyapatna Roopa Kumar, Manav Anandani, Nischitha Nagendran

Department of Applied Data Science, College of Information, Data, and Society, San Jose State University

Download Report View on MDPI ↗DOI: 10.3390/a181207300% read

Abstract

Pandemics emphasize the importance of real-time, interpretable clinical decision-support systems for identifying high-risk patients and assisting with prompt triage, particularly in data-intensive healthcare systems. This paper describes a novel dual big-data pipeline that includes (i) a streaming module for real-time epidemiological hospitalization risk prediction and (ii) a supplementary imaging-based detection and reasoning module for chest X-rays, with COVID-19 as an example.

The first pipeline uses state-of-the-art machine learning algorithms to estimate patient-level hospitalization risk based on data from the Centers for Disease Control and Prevention's (CDC) COVID-19 Case Surveillance dataset. A Bloom filter accelerated triage by constant-time pre-screening of high-risk profiles. XGBoost was selected after significant experimentation and optimization, achieving the best minority-class F1-score (0.76) and recall (0.80), outperforming baseline models.

The second pipeline focuses on diagnostic imaging: a convolutional neural network (EfficientNet-B0) classifies chest X-rays, with Grad-CAM providing visual explanations. A lightweight GPT-based reasoning layer converts model predictions into auditable triage comments (ALERT/FLAG/LOG). CTGAN generates synthetic tabular data for streaming stress-tests.

A scalable, explainable, and near-real-time framework providing a foundation for future multimodal and genomic advancements in public health readiness.

XGBoost Minority F1

0.76

Hospitalization Recall

0.80

Chest X-Ray Accuracy

99.5%

External Test Accuracy

99.3%

CDC Records Processed

3M+

Bloom Filter Latency Gain

3–6%

System Architecture

The proposed method integrates two complementary real-time operations into a single pandemic-triage architecture. The epidemiological pipeline uses machine learning models and Bloom filter pre-screening to enable large-scale screening and prioritizing by swiftly estimating hospitalization risk from tabular case-surveillance data. The imaging-based pipeline uses chest X-rays to validate diagnoses, with a GPT-based reasoning layer that creates auditable triage summaries (ALERT, FLAG, and LOG) using Grad-CAM for visual explanation.

Fig 3 — Overall Dual Big-Data Pipeline Architecture

Part 1 — Epidemiological Pipeline

CDC Case Surveillance (Tabular)

3M+ patient records

Pre-Process & Encode

Standardize · Balance classes

Bloom Filter

O(1) high-risk pre-screen

XGBoost Classifier

Grid Search CV · F1=0.76

Kafka–Spark Stream

Apache Kafka 3.9 · Spark UDFs

Low Risk

No prioritization

High Risk

Requires triage

CTGAN Synthetic Data

300/500/1000 epoch stress test

Part 2 — Imaging & Reasoning Pipeline

TCIA COVID-19 Archive

DICOM → PNG

Kaggle Radiography DB

3,867 COVID · 10,192 Normal

Preprocessing & Augmentation

Resize · Normalize · Flip · Rotation

EfficientNet-B0

Fine-tuned · Binary classifier

Grad-CAM Visualization

Penultimate conv layer heatmaps

GPT-Based Agentic Reasoning

GPT-3.5-turbo · T=0.2 · 300 tokens

ALERT

Immediate referral

FLAG

Manual review

LOG

Normal — archive

Part 1 — Epidemiological Risk Prediction

Dataset: CDC COVID-19 Case Surveillance

The tabular component uses the CDC COVID-19 Case Surveillance Public Use dataset, retrieved directly through the CDC Socrata API in batches of 10,000 rows per request, up to 4,000,000 records. Demographic attributes (sex, age_group, race_ethnicity_combined), clinical attributes (hosp_yn, icu_yn, death_yn, medcond_yn), and case metadata (cdc_case_earliest_dt, current_status) were selected for analysis.

The dataset exhibited a significant class imbalance — hospitalized cases were substantially outnumbered by non-hospitalized ones. Random undersampling of the majority class produced a 1:1 ratio, yielding a final balanced dataset used for all model training and evaluation.

XGBoost Classifier with Grid Search

Three machine-learning models were tested: Extreme Gradient Boosting (XGBoost), Random Forest, and Logistic Regression. XGBoost was chosen as the main classifier for further optimization due to its ability to handle heterogeneous feature types, capture nonlinear interactions, and successfully control class imbalance. GridSearchCV with threefold cross-validation optimized hyperparameters including n_estimators, max_depth, learning_rate, subsample, and colsample_bytree.

The optimal configuration was max_depth=3 and learning_rate=0.05, producing a cross-validated F1-score of 0.759. The top 10 features by importance were dominated by age_group_80_years, age_group_70_79_years, and medcond_yn_No — confirming that age and comorbidity status are the strongest predictors of hospitalization risk.

Table 2 — Model performance comparison across classifiers

Model	Accuracy	F1 (No)	F1 (Yes)	Mean F1
Logistic Regression	0.75	0.75	0.75	0.75
Random Forest	0.75	0.76	0.74	0.75
XGBoost	0.75	0.75	0.76	0.75
LightGBM	0.75	0.76	0.75	0.75
CatBoost	0.75	0.75	0.74	0.75

Bloom Filter Pre-Screening

A Bloom filter was trained on high-risk rows from the dataset, using prior knowledge of hospitalization characteristics. Bloom filters support O(1) membership searches with predictable false-positive rates, making them ideal for quick triage in high-throughput environments. During inference, each incoming record was initially checked against the filter — if flagged as possibly high-risk, it was forwarded to XGBoost for thorough evaluation, otherwise routed as low-risk. This reduced end-to-end latency by 3–6% across test cycles of 100,000 records per iteration.

Fig 4 — Tabular Hospitalization Risk Prediction Pipeline

CDC Surveillance (Tabular)

Pre-Process & Encode

Missing values · categorical encoding

Synthetic Data Gen

Random · Weighted · CTGAN

Model Selection & Grid Search

XGBoost / RF / LR

Train Bloom Filter

High-risk profiles

Kafka Integration & Spark Stream

Real-time patient record intake

Bloom?

Yes

Trained ML Model

XGBoost F1=0.76

High Risk

Requires prioritization

Low Risk

No prioritization needed

Kafka–Spark Streaming Integration

The creation of an end-to-end near-real-time pipeline was made possible by the integration of Apache Spark and Apache Kafka version 3.9.0 (Scala 2.12 build). Kafka served as the message broker for continuous intake of both real and generated data streams. Spark Streaming deployed user-defined functions (UDFs) that successively called the XGBoost classifier and the trained Bloom filter. Results were saved for further examination after being formatted into structured JSON data.

Part 2 — Chest X-Ray Classification

Dataset Composition

Table 1 — Chest X-ray dataset composition before and after augmentation

Category	After Merge (TCIA + Kaggle)	After Augmentation (Train)	External Test Set
Normal Images	10,192	3,000	317
COVID Images	3,867	3,000	116
Total	14,059	6,000	433

EfficientNet-B0 Fine-Tuning

The pre-trained EfficientNet-B0 model (trained originally on ImageNet) was adapted for binary classification by replacing its final fully connected layer with a new binary classification head (COVID-19 vs. Normal). The model was fine-tuned on the balanced dataset of 3,000 normal and 3,000 COVID-positive images, with input images resized to 224×224 and normalized. Data augmentation included random horizontal flip, rotation, brightness/contrast jitter, and affine transforms.

EfficientNet-B0 was selected after comparing several architectures (custom baseline CNN, MobileNetV2, ResNet-18) — it had the most consistent and stable performance across different training runs, doing best in COVID-19 sensitivity and overall F1-score. Internal accuracy reached 99.5% and external test accuracy reached 99.3%.

Grad-CAM Visual Explanation

Gradient-weighted Class Activation Mapping (Grad-CAM) was applied to the trained EfficientNet-B0 model to highlight the most significant areas in each input image responsible for the model prediction. The Grad-CAM process finds the last convolutional layer and weights the feature maps using gradients of the predicted class, producing a heatmap overlaid on the original chest X-ray. Red/yellow areas indicate high model attention; blue indicates lower attention.

The model tends to focus on the lower or peripheral regions of the lung in cases of COVID-positive predictions, consistent with documented radiographic patterns for COVID-19 pneumonia (bilateral lower-lobe consolidation).

GPT-Based Agentic Reasoning

A GPT-based agentic reasoning module was integrated to simulate independent, clinically significant triage decision-making. The module accepts structured input from the EfficientNet-B0 prediction, confidence score, and Grad-CAM interpretation, and produces a clinical-style decision and rationale. GPT-3.5-turbo was used via the OpenAI API with temperature=0.2 and max_tokens=300. Parsing the model's textual answer yielded three standardized triage outcomes — ALERT, FLAG, or LOG — along with their justifications.

Sample GPT-generated decision — LOG response

Decision:   LOG
Reasoning:  Model predicted Normal with very high confidence (99.99%), and
            the Grad-CAM heatmap is in agreement with the expected normal
            lung appearance. There is no abnormal activation sign, and the
            output aligns with the ground truth. Logging the case is
            appropriate in this scenario.

Sample GPT-generated decision — ALERT response

Decision:   ALERT
Reasoning:  The model is calling COVID-19 with 100% certainty, and the
            Grad-CAM heatmap is showing very dense bilateral activation
            in the lower lung fields — a classic appearance of COVID-19
            pneumonia. The ground truth is COVID-19. This case should be
            referred to a physician for immediate follow-up.

Results & Discussion

Tabular Model Performance

Although overall accuracies were equivalent across models (~0.75), XGBoost was preferred because it achieved the greatest recall and F1-score for hospitalized cases, decreasing false negatives in this clinically critical area. Its ROC curve outperforms the others across most false-positive rates, demonstrating a stronger capacity for discrimination.

Chest X-Ray Performance

EfficientNet-B0 achieved 99.5% internal accuracy and 99.3% external accuracy on the held-out test set of 433 images. The model correctly classified all true positives and true negatives in the external test set, with no false positive predictions. Grad-CAM activations showed 87% overlap with ground truth lung regions in TP cases.

Synthetic Data (CTGAN) Fidelity

The CTGAN-generated synthetic data was evaluated at 300, 500, and 1000 epochs. CTGAN produced the most realistic samples by maintaining both class proportions and inter-feature relationships, outperforming random and weighted distribution sampling strategies. This confirmed its suitability for streaming pipeline stress-testing.

Conclusion

This paper presents a scalable, dual big-data pipeline for real-time pandemic triage that addresses the discrepancy between the quick increase in patient danger and the clinical decision-making systems' later reactions. By combining XGBoost epidemiological risk modeling, Bloom filter pre-screening, EfficientNet-B0 chest X-ray classification, Grad-CAM explainability, and GPT-based auditable reasoning within a Kafka-Spark streaming infrastructure, the system provides dependable and auditable triage support.

The framework provides a foundation for future multimodal and genomic advancements in public health readiness. Future work will address multi-modal fusion of tabular and imaging streams, integration with electronic health records (EHR), and expansion to other respiratory conditions beyond COVID-19.

Keywords

hospitalization risk predictionreal-time streaming analyticschest X-ray classificationexplainable AI (XAI)bloom filterCTGANKafka–Spark streamingclinical decision support

← Back to all research