A Multimodal Generative AI Framework for Cancer Pathology Classification
Mayank Kapadia, Basanth Periyapatna Roopa Kumar, Nischitha Nagendran
Department of Applied Data Science · San Jose State University
Abstract
We propose a multimodal generative AI framework for cancer pathology that combines three capabilities: (i) histopathology picture classification, (ii) clinical note classification, and (iii) prompt-driven clinical captioning using retrieval-augmented generation (RAG). The approach establishes accurate vision baselines on PatchCamelyon (PCam) and creates a complementing text pipeline by curating and categorizing TCGA BRCA clinical notes via two distinct routes to assure trustworthy supervision.
We investigate caption utility by creating image-based descriptions and constructing a lightweight caption-label classifier to assess RAG's downstream worth. Together, these components establish a uniform data and model foundation, explain interfaces for combining picture and text representations, and give an evidence-based approach to short, clinically grounded summaries.
The main objective is to provide a practical, auditable procedure that improves diagnostic support by matching visual findings with structured language while still allowing for parameter-efficient adjustment and eventual inclusion into clinical decision tools.
ViT-Base/16 ROC-AUC
0.9601
ViT-Base/16 F1
0.8852
ClinicalBERT LoRA F1
0.94 ± 0.05
PCam Training Images
262,144
BRCA Clinical Notes
2,380
Tree-of-Thought ROUGE-L
0.320
System Architecture
System Architecture — Multimodal Generative AI Framework
Branch 1 — Histopathology Images
PatchCamelyon (PCam)
262,144 train · 32,768 val · 32,768 test
Tensor Conversion & Normalization
Standard preprocessing — no augmentation
EfficientNet-B0
Acc 0.8760 · ROC-AUC 0.9508
ResNet-50
Acc 0.8871 · ROC-AUC 0.9500
ViT-Base/16
Acc 0.8898 · ROC-AUC 0.9601
BLIP-2 (FLAN-T5-XL) Captioning
Caption quality via CLIP similarity
Caption → Label Classifier (FLAN-T5)
Caption signal analysis
Branch 2 — Clinical Notes (TCGA BRCA)
TCGA Pathology Reports (~48K)
Filter: BRCA cohort → 1,105 reports
PyTesseract OCR
Keyword + Negation Rules
TCGA Metadata
Morphological codes
Class Imbalance (1,190 M vs 89 B)
Upsample benign → 1:1 ratio
Final Balanced Dataset
2,380 reports · Train 1,428 / Val 476 / Test 476
ClinicalBERT Classifier
Pretrained on clinical notes
SFT → TAPT → LoRA Fine-Tuning
Best: rank r=4 · α=8 · LR 2e-3 · F1=0.94
Unified — RAG Caption Generation
GPT-4o Ground Truth Captions
Selected via DeepSeek-as-Judge from 300 stratified notes
LiquidAI RAG Pipeline
Fine-tuned with LoRA · 4 prompting strategies
Zero-Shot
Self-Reflect
Self-Ask
Tree-of-Thought
Datasets
Histopathology Images — PatchCamelyon (PCam)
Experiments use the PatchCamelyon (PCam) histopathology patch dataset via the torchvision PCAM interface. Each example is a color image patch paired with a binary label (benign vs. malignant). Official splits are preserved: 262,144 training, 32,768 validation, and 32,768 testing images. No relabeling, rebalancing, or re-splitting was performed. All reported metrics are computed on the fixed splits for strict reproducibility.
Clinical Notes — TCGA BRCA
Clinical notes were provided via the Cancer Genome Atlas (TCGA) portal, focused on the Breast Invasive Carcinoma (BRCA) cohort. 1,105 BRCA-specific pathology reports were extracted from ~48K reports on various cancer types. Two annotation approaches were used: Approach 1 (PyTesseract OCR + keyword/negation rules) and Approach 2 (TCGA metadata morphological codes). The dataset exhibited considerable class imbalance (1,190 malignant vs. 89 benign), which was addressed by random upsampling to achieve a 1:1 ratio. The final balanced dataset of 2,380 reports was split: train 1,428 / val 476 / test 476.
Methodology
Histopathology Classification
Three modern image backbones were benchmarked on PCam using a unified classification pipeline. Each backbone was instantiated from standard libraries and adapted to binary prediction with a single linear classification head. Decision thresholds were selected on the validation set (ResNet-50 τ=0.22, ViT-B/16 τ=0.36) and then fixed for test-time reporting.
PCam Test Performance of Image Classifiers
| Model | Threshold | Accuracy | F1 | ROC-AUC |
|---|---|---|---|---|
| EfficientNet-B0 | 0.50 | 0.8760 | 0.8627 | 0.9508 |
| ResNet-50 | 0.22 | 0.8871 | 0.8818 | 0.9500 |
| ViT-Base/16 | 0.36 | 0.8898 | 0.8852 | 0.9601 |
ViT-Base/16 delivers the most consistent performance across all metrics; ResNet-50 is a close second; EfficientNet-B0 remains a strong lightweight baseline even without threshold tuning. These fixed numbers establish robust image baselines to be held unchanged in subsequent fusion and caption-aware experiments.
Clinical Note Classification — ClinicalBERT Fine-Tuning
The balanced clinical notes are used as input to a ClinicalBERT model for benign and malignant classification. We evaluate the pretrained model as a baseline, then sequentially apply Supervised Fine-Tuning (SFT) and Task-Adaptive Pretraining (TAPT) to increase domain adaptation, followed by LoRA to fine-tune parameters efficiently.
ClinicalBERT Sequential Training Pipeline (Fig 3)
Processed Clinical Notes
2,380 BRCA pathology reports
ClinicalBERT Classifier
Pretrained base — evaluation
Supervised Fine-Tuning (SFT)
Standard labeled training
Task-Adaptive Pretraining (TAPT)
Domain text adaptation
Low-Rank Adaptation (LoRA)
r ∈ {4,8,16} · α ∈ {8,16,32,64}
Predictions & Metrics
Accuracy · Precision · Recall · F1 · ROC-AUC
The optimum LoRA configuration (rank r=4, α=8, learning rate 2e-3, 10 epochs) had the highest stable validation F1-score (0.94 ± 0.05) and was chosen for clinical note classification in the multimodal fusion experiments. The experiments were conducted in two environments: SJSU GPU Lab (RTX 5090) and Google Colab (A100 GPU).
Caption Generation — LiquidAI RAG + Prompting Strategies
Caption generation addresses limited caption supervision by first comparing three zero-shot LLMs (GPT-4o, GPT-3.5-turbo, Claude 3.5 Sonnet) on 300 stratified notes. A fourth model (DeepSeek) served as an objective judge and selected GPT-4o as the best performer; its captions were adopted as ground truth for the LiquidAI RAG pipeline. RAG was then trained and fine-tuned with LoRA (rank r, scale α) while tuning output token limit and input context length. Four prompting strategies were evaluated within the RAG setup under identical hyperparameters: Zero-Shot, Self-Reflection, Self-Ask, and Tree-of-Thought.
Table XI — Best Performance Comparison Across Prompting Techniques (LoRA-RAG)
| Technique | ROUGE-1 | ROUGE-2 | ROUGE-L | BLEU | BERTScore-F1 |
|---|---|---|---|---|---|
| Zero-Shot | 0.363 | 0.180 | 0.276 | 0.106 | 0.843 |
| Self-Reflection | 0.247 | 0.100 | 0.187 | 0.060 | 0.833 |
| Self-Ask | 0.227 | 0.105 | 0.178 | 0.063 | 0.635 |
| Tree-of-Thought | 0.409 | 0.214 | 0.320 | 0.130 | 0.862 |
Tree-of-Thought (ToT) achieved the highest overall lexical and semantic alignment (R1=0.409, RL=0.320, BF1=0.862) under the LoRA-RAG setup (r=16, α=64, LR=7e-6, max_len=2048), confirming that structured reasoning enhances factual completeness in medical captioning.
Conclusion
Together, the PCam image baselines, ClinicalBERT fine-tuning pipeline, BLIP-2 caption analysis, and LoRA-RAG captioning system establish a uniform data and model foundation for multimodal cancer pathology classification. The framework provides evidence-based approaches to short, clinically grounded summaries while remaining parameter-efficient and extensible to future fusion architectures.
Tech Stack