Generative AISJSU · ADS · 2025

A Multimodal Generative AI Framework for Cancer Pathology Classification

Mayank Kapadia, Basanth Periyapatna Roopa Kumar, Nischitha Nagendran

Department of Applied Data Science · San Jose State University

Abstract

We propose a multimodal generative AI framework for cancer pathology that combines three capabilities: (i) histopathology picture classification, (ii) clinical note classification, and (iii) prompt-driven clinical captioning using retrieval-augmented generation (RAG). The approach establishes accurate vision baselines on PatchCamelyon (PCam) and creates a complementing text pipeline by curating and categorizing TCGA BRCA clinical notes via two distinct routes to assure trustworthy supervision.

We investigate caption utility by creating image-based descriptions and constructing a lightweight caption-label classifier to assess RAG's downstream worth. Together, these components establish a uniform data and model foundation, explain interfaces for combining picture and text representations, and give an evidence-based approach to short, clinically grounded summaries.

The main objective is to provide a practical, auditable procedure that improves diagnostic support by matching visual findings with structured language while still allowing for parameter-efficient adjustment and eventual inclusion into clinical decision tools.

ViT-Base/16 ROC-AUC

0.9601

ViT-Base/16 F1

0.8852

ClinicalBERT LoRA F1

0.94 ± 0.05

PCam Training Images

262,144

BRCA Clinical Notes

2,380

Tree-of-Thought ROUGE-L

0.320

System Architecture

System Architecture — Multimodal Generative AI Framework

Branch 1 — Histopathology Images

PatchCamelyon (PCam)

262,144 train · 32,768 val · 32,768 test

Tensor Conversion & Normalization

Standard preprocessing — no augmentation

EfficientNet-B0

Acc 0.8760 · ROC-AUC 0.9508

ResNet-50

Acc 0.8871 · ROC-AUC 0.9500

ViT-Base/16

Acc 0.8898 · ROC-AUC 0.9601

BLIP-2 (FLAN-T5-XL) Captioning

Caption quality via CLIP similarity

Caption → Label Classifier (FLAN-T5)

Caption signal analysis

Branch 2 — Clinical Notes (TCGA BRCA)

TCGA Pathology Reports (~48K)

Filter: BRCA cohort → 1,105 reports

PyTesseract OCR

Keyword + Negation Rules

TCGA Metadata

Morphological codes

Class Imbalance (1,190 M vs 89 B)

Upsample benign → 1:1 ratio

Final Balanced Dataset

2,380 reports · Train 1,428 / Val 476 / Test 476

ClinicalBERT Classifier

Pretrained on clinical notes

SFT → TAPT → LoRA Fine-Tuning

Best: rank r=4 · α=8 · LR 2e-3 · F1=0.94

Unified — RAG Caption Generation

GPT-4o Ground Truth Captions

Selected via DeepSeek-as-Judge from 300 stratified notes

LiquidAI RAG Pipeline

Fine-tuned with LoRA · 4 prompting strategies

Zero-Shot

Self-Reflect

Self-Ask

Tree-of-Thought

Datasets

Histopathology Images — PatchCamelyon (PCam)

Experiments use the PatchCamelyon (PCam) histopathology patch dataset via the torchvision PCAM interface. Each example is a color image patch paired with a binary label (benign vs. malignant). Official splits are preserved: 262,144 training, 32,768 validation, and 32,768 testing images. No relabeling, rebalancing, or re-splitting was performed. All reported metrics are computed on the fixed splits for strict reproducibility.

Clinical Notes — TCGA BRCA

Clinical notes were provided via the Cancer Genome Atlas (TCGA) portal, focused on the Breast Invasive Carcinoma (BRCA) cohort. 1,105 BRCA-specific pathology reports were extracted from ~48K reports on various cancer types. Two annotation approaches were used: Approach 1 (PyTesseract OCR + keyword/negation rules) and Approach 2 (TCGA metadata morphological codes). The dataset exhibited considerable class imbalance (1,190 malignant vs. 89 benign), which was addressed by random upsampling to achieve a 1:1 ratio. The final balanced dataset of 2,380 reports was split: train 1,428 / val 476 / test 476.

Methodology

Histopathology Classification

Three modern image backbones were benchmarked on PCam using a unified classification pipeline. Each backbone was instantiated from standard libraries and adapted to binary prediction with a single linear classification head. Decision thresholds were selected on the validation set (ResNet-50 τ=0.22, ViT-B/16 τ=0.36) and then fixed for test-time reporting.

PCam Test Performance of Image Classifiers

Model	Threshold	Accuracy	F1	ROC-AUC
EfficientNet-B0	0.50	0.8760	0.8627	0.9508
ResNet-50	0.22	0.8871	0.8818	0.9500
ViT-Base/16	0.36	0.8898	0.8852	0.9601

ViT-Base/16 delivers the most consistent performance across all metrics; ResNet-50 is a close second; EfficientNet-B0 remains a strong lightweight baseline even without threshold tuning. These fixed numbers establish robust image baselines to be held unchanged in subsequent fusion and caption-aware experiments.

Clinical Note Classification — ClinicalBERT Fine-Tuning

The balanced clinical notes are used as input to a ClinicalBERT model for benign and malignant classification. We evaluate the pretrained model as a baseline, then sequentially apply Supervised Fine-Tuning (SFT) and Task-Adaptive Pretraining (TAPT) to increase domain adaptation, followed by LoRA to fine-tune parameters efficiently.

ClinicalBERT Sequential Training Pipeline (Fig 3)

Processed Clinical Notes

2,380 BRCA pathology reports

ClinicalBERT Classifier

Pretrained base — evaluation

Supervised Fine-Tuning (SFT)

Standard labeled training

Task-Adaptive Pretraining (TAPT)

Domain text adaptation

Low-Rank Adaptation (LoRA)

r ∈ {4,8,16} · α ∈ {8,16,32,64}

Predictions & Metrics

Accuracy · Precision · Recall · F1 · ROC-AUC

The optimum LoRA configuration (rank r=4, α=8, learning rate 2e-3, 10 epochs) had the highest stable validation F1-score (0.94 ± 0.05) and was chosen for clinical note classification in the multimodal fusion experiments. The experiments were conducted in two environments: SJSU GPU Lab (RTX 5090) and Google Colab (A100 GPU).

Caption Generation — LiquidAI RAG + Prompting Strategies

Caption generation addresses limited caption supervision by first comparing three zero-shot LLMs (GPT-4o, GPT-3.5-turbo, Claude 3.5 Sonnet) on 300 stratified notes. A fourth model (DeepSeek) served as an objective judge and selected GPT-4o as the best performer; its captions were adopted as ground truth for the LiquidAI RAG pipeline. RAG was then trained and fine-tuned with LoRA (rank r, scale α) while tuning output token limit and input context length. Four prompting strategies were evaluated within the RAG setup under identical hyperparameters: Zero-Shot, Self-Reflection, Self-Ask, and Tree-of-Thought.

Table XI — Best Performance Comparison Across Prompting Techniques (LoRA-RAG)

Technique	ROUGE-1	ROUGE-2	ROUGE-L	BLEU	BERTScore-F1
Zero-Shot	0.363	0.180	0.276	0.106	0.843
Self-Reflection	0.247	0.100	0.187	0.060	0.833
Self-Ask	0.227	0.105	0.178	0.063	0.635
Tree-of-Thought	0.409	0.214	0.320	0.130	0.862

Tree-of-Thought (ToT) achieved the highest overall lexical and semantic alignment (R1=0.409, RL=0.320, BF1=0.862) under the LoRA-RAG setup (r=16, α=64, LR=7e-6, max_len=2048), confirming that structured reasoning enhances factual completeness in medical captioning.

Conclusion

Together, the PCam image baselines, ClinicalBERT fine-tuning pipeline, BLIP-2 caption analysis, and LoRA-RAG captioning system establish a uniform data and model foundation for multimodal cancer pathology classification. The framework provides evidence-based approaches to short, clinically grounded summaries while remaining parameter-efficient and extensible to future fusion architectures.

Tech Stack

PyTorch v2.9.0Transformers v4.57.1NumPyPandasscikit-learnDatasetsPEFTBLIP-2ClinicalBERTLiquidAI RAGGPT-4oDeepSeek

← Back to all research