A Big Data Pipeline Approach for Predicting Real-Time Pandemic Hospitalization Risk
Vishnu S. Pendyala, Mayank Kapadia, Basanth Periyapatna Roopa Kumar, Manav Anandani, Nischitha Nagendran
Department of Applied Data Science, College of Information, Data, and Society, San Jose State University
Abstract
Pandemics emphasize the importance of real-time, interpretable clinical decision-support systems for identifying high-risk patients and assisting with prompt triage, particularly in data-intensive healthcare systems. This paper describes a novel dual big-data pipeline that includes (i) a streaming module for real-time epidemiological hospitalization risk prediction and (ii) a supplementary imaging-based detection and reasoning module for chest X-rays, with COVID-19 as an example.
The first pipeline uses state-of-the-art machine learning algorithms to estimate patient-level hospitalization risk based on data from the Centers for Disease Control and Prevention's (CDC) COVID-19 Case Surveillance dataset. A Bloom filter accelerated triage by constant-time pre-screening of high-risk profiles. XGBoost was selected after significant experimentation and optimization, achieving the best minority-class F1-score (0.76) and recall (0.80), outperforming baseline models.
The second pipeline focuses on diagnostic imaging: a convolutional neural network (EfficientNet-B0) classifies chest X-rays, with Grad-CAM providing visual explanations. A lightweight GPT-based reasoning layer converts model predictions into auditable triage comments (ALERT/FLAG/LOG). CTGAN generates synthetic tabular data for streaming stress-tests.
A scalable, explainable, and near-real-time framework providing a foundation for future multimodal and genomic advancements in public health readiness.
XGBoost Minority F1
0.76
Hospitalization Recall
0.80
Chest X-Ray Accuracy
99.5%
External Test Accuracy
99.3%
CDC Records Processed
3M+
Bloom Filter Latency Gain
3–6%
System Architecture
The proposed method integrates two complementary real-time operations into a single pandemic-triage architecture. The epidemiological pipeline uses machine learning models and Bloom filter pre-screening to enable large-scale screening and prioritizing by swiftly estimating hospitalization risk from tabular case-surveillance data. The imaging-based pipeline uses chest X-rays to validate diagnoses, with a GPT-based reasoning layer that creates auditable triage summaries (ALERT, FLAG, and LOG) using Grad-CAM for visual explanation.
Fig 3 — Overall Dual Big-Data Pipeline Architecture
Part 1 — Epidemiological Pipeline
CDC Case Surveillance (Tabular)
3M+ patient records
Pre-Process & Encode
Standardize · Balance classes
Bloom Filter
O(1) high-risk pre-screen
XGBoost Classifier
Grid Search CV · F1=0.76
Kafka–Spark Stream
Apache Kafka 3.9 · Spark UDFs
Low Risk
No prioritization
High Risk
Requires triage
CTGAN Synthetic Data
300/500/1000 epoch stress test
Part 2 — Imaging & Reasoning Pipeline
TCIA COVID-19 Archive
DICOM → PNG
Kaggle Radiography DB
3,867 COVID · 10,192 Normal
Preprocessing & Augmentation
Resize · Normalize · Flip · Rotation
EfficientNet-B0
Fine-tuned · Binary classifier
Grad-CAM Visualization
Penultimate conv layer heatmaps
GPT-Based Agentic Reasoning
GPT-3.5-turbo · T=0.2 · 300 tokens
ALERT
Immediate referral
FLAG
Manual review
LOG
Normal — archive
Part 1 — Epidemiological Risk Prediction
Dataset: CDC COVID-19 Case Surveillance
The tabular component uses the CDC COVID-19 Case Surveillance Public Use dataset, retrieved directly through the CDC Socrata API in batches of 10,000 rows per request, up to 4,000,000 records. Demographic attributes (sex, age_group, race_ethnicity_combined), clinical attributes (hosp_yn, icu_yn, death_yn, medcond_yn), and case metadata (cdc_case_earliest_dt, current_status) were selected for analysis.
The dataset exhibited a significant class imbalance — hospitalized cases were substantially outnumbered by non-hospitalized ones. Random undersampling of the majority class produced a 1:1 ratio, yielding a final balanced dataset used for all model training and evaluation.
XGBoost Classifier with Grid Search
Three machine-learning models were tested: Extreme Gradient Boosting (XGBoost), Random Forest, and Logistic Regression. XGBoost was chosen as the main classifier for further optimization due to its ability to handle heterogeneous feature types, capture nonlinear interactions, and successfully control class imbalance. GridSearchCV with threefold cross-validation optimized hyperparameters including n_estimators, max_depth, learning_rate, subsample, and colsample_bytree.
The optimal configuration was max_depth=3 and learning_rate=0.05, producing a cross-validated F1-score of 0.759. The top 10 features by importance were dominated by age_group_80_years, age_group_70_79_years, and medcond_yn_No — confirming that age and comorbidity status are the strongest predictors of hospitalization risk.
Table 2 — Model performance comparison across classifiers
| Model | Accuracy | F1 (No) | F1 (Yes) | Mean F1 |
|---|---|---|---|---|
| Logistic Regression | 0.75 | 0.75 | 0.75 | 0.75 |
| Random Forest | 0.75 | 0.76 | 0.74 | 0.75 |
| XGBoost | 0.75 | 0.75 | 0.76 | 0.75 |
| LightGBM | 0.75 | 0.76 | 0.75 | 0.75 |
| CatBoost | 0.75 | 0.75 | 0.74 | 0.75 |
Bloom Filter Pre-Screening
A Bloom filter was trained on high-risk rows from the dataset, using prior knowledge of hospitalization characteristics. Bloom filters support O(1) membership searches with predictable false-positive rates, making them ideal for quick triage in high-throughput environments. During inference, each incoming record was initially checked against the filter — if flagged as possibly high-risk, it was forwarded to XGBoost for thorough evaluation, otherwise routed as low-risk. This reduced end-to-end latency by 3–6% across test cycles of 100,000 records per iteration.
Fig 4 — Tabular Hospitalization Risk Prediction Pipeline
CDC Surveillance (Tabular)
Pre-Process & Encode
Missing values · categorical encoding
Synthetic Data Gen
Random · Weighted · CTGAN
Model Selection & Grid Search
XGBoost / RF / LR
Train Bloom Filter
High-risk profiles
Kafka Integration & Spark Stream
Real-time patient record intake
Yes
Trained ML Model
XGBoost F1=0.76
High Risk
Requires prioritization
No
Low Risk
No prioritization needed
Kafka–Spark Streaming Integration
The creation of an end-to-end near-real-time pipeline was made possible by the integration of Apache Spark and Apache Kafka version 3.9.0 (Scala 2.12 build). Kafka served as the message broker for continuous intake of both real and generated data streams. Spark Streaming deployed user-defined functions (UDFs) that successively called the XGBoost classifier and the trained Bloom filter. Results were saved for further examination after being formatted into structured JSON data.
Part 2 — Chest X-Ray Classification
Dataset Composition
Table 1 — Chest X-ray dataset composition before and after augmentation
| Category | After Merge (TCIA + Kaggle) | After Augmentation (Train) | External Test Set |
|---|---|---|---|
| Normal Images | 10,192 | 3,000 | 317 |
| COVID Images | 3,867 | 3,000 | 116 |
| Total | 14,059 | 6,000 | 433 |
EfficientNet-B0 Fine-Tuning
The pre-trained EfficientNet-B0 model (trained originally on ImageNet) was adapted for binary classification by replacing its final fully connected layer with a new binary classification head (COVID-19 vs. Normal). The model was fine-tuned on the balanced dataset of 3,000 normal and 3,000 COVID-positive images, with input images resized to 224×224 and normalized. Data augmentation included random horizontal flip, rotation, brightness/contrast jitter, and affine transforms.
EfficientNet-B0 was selected after comparing several architectures (custom baseline CNN, MobileNetV2, ResNet-18) — it had the most consistent and stable performance across different training runs, doing best in COVID-19 sensitivity and overall F1-score. Internal accuracy reached 99.5% and external test accuracy reached 99.3%.
Grad-CAM Visual Explanation
Gradient-weighted Class Activation Mapping (Grad-CAM) was applied to the trained EfficientNet-B0 model to highlight the most significant areas in each input image responsible for the model prediction. The Grad-CAM process finds the last convolutional layer and weights the feature maps using gradients of the predicted class, producing a heatmap overlaid on the original chest X-ray. Red/yellow areas indicate high model attention; blue indicates lower attention.
The model tends to focus on the lower or peripheral regions of the lung in cases of COVID-positive predictions, consistent with documented radiographic patterns for COVID-19 pneumonia (bilateral lower-lobe consolidation).
GPT-Based Agentic Reasoning
A GPT-based agentic reasoning module was integrated to simulate independent, clinically significant triage decision-making. The module accepts structured input from the EfficientNet-B0 prediction, confidence score, and Grad-CAM interpretation, and produces a clinical-style decision and rationale. GPT-3.5-turbo was used via the OpenAI API with temperature=0.2 and max_tokens=300. Parsing the model's textual answer yielded three standardized triage outcomes — ALERT, FLAG, or LOG — along with their justifications.
Sample GPT-generated decision — LOG response
Decision: LOG
Reasoning: Model predicted Normal with very high confidence (99.99%), and
the Grad-CAM heatmap is in agreement with the expected normal
lung appearance. There is no abnormal activation sign, and the
output aligns with the ground truth. Logging the case is
appropriate in this scenario.Sample GPT-generated decision — ALERT response
Decision: ALERT
Reasoning: The model is calling COVID-19 with 100% certainty, and the
Grad-CAM heatmap is showing very dense bilateral activation
in the lower lung fields — a classic appearance of COVID-19
pneumonia. The ground truth is COVID-19. This case should be
referred to a physician for immediate follow-up.Results & Discussion
Tabular Model Performance
Although overall accuracies were equivalent across models (~0.75), XGBoost was preferred because it achieved the greatest recall and F1-score for hospitalized cases, decreasing false negatives in this clinically critical area. Its ROC curve outperforms the others across most false-positive rates, demonstrating a stronger capacity for discrimination.
Chest X-Ray Performance
EfficientNet-B0 achieved 99.5% internal accuracy and 99.3% external accuracy on the held-out test set of 433 images. The model correctly classified all true positives and true negatives in the external test set, with no false positive predictions. Grad-CAM activations showed 87% overlap with ground truth lung regions in TP cases.
Synthetic Data (CTGAN) Fidelity
The CTGAN-generated synthetic data was evaluated at 300, 500, and 1000 epochs. CTGAN produced the most realistic samples by maintaining both class proportions and inter-feature relationships, outperforming random and weighted distribution sampling strategies. This confirmed its suitability for streaming pipeline stress-testing.
Conclusion
This paper presents a scalable, dual big-data pipeline for real-time pandemic triage that addresses the discrepancy between the quick increase in patient danger and the clinical decision-making systems' later reactions. By combining XGBoost epidemiological risk modeling, Bloom filter pre-screening, EfficientNet-B0 chest X-ray classification, Grad-CAM explainability, and GPT-based auditable reasoning within a Kafka-Spark streaming infrastructure, the system provides dependable and auditable triage support.
The framework provides a foundation for future multimodal and genomic advancements in public health readiness. Future work will address multi-modal fusion of tabular and imaging streams, integration with electronic health records (EHR), and expansion to other respiratory conditions beyond COVID-19.
Keywords