Beyond CLIP: Knowledge-Enhanced Multimodal Transformers for Cross-Modal Alignment in Diabetic Retinopathy Diagnosis
Summary
This paper introduces a novel knowledge-enhanced multimodal transformer framework designed to overcome the limitations of general-domain vision-language models like CLIP in medical applications, specifically for diabetic retinopathy (DR) diagnosis. By integrating retinal images, clinical text, and structured patient data through specialized encoders and a joint transformer, the model achieves superior cross-modal alignment and state-of-the-art DR classification performance. The framework dramatically improves medical image-text retrieval accuracy and exhibits strong zero-shot generalization.
Medical Relevance
This research is highly relevant to medical AI by providing a crucial advancement in accurate automated diagnostic systems for diabetic retinopathy, a leading cause of preventable blindness. It addresses the specificity needed for medical data by integrating diverse patient information, enabling more precise interpretation and retrieval of multimodal clinical records.
AI Health Application
This paper proposes an AI system (knowledge-enhanced multimodal transformer) designed to diagnose and grade the severity of diabetic retinopathy. It serves as an automated diagnostic tool for ophthalmology, capable of cross-modal retrieval (e.g., finding relevant images from text descriptions) and classification based on medical standards (ICDR, SDRG). This application aims to improve the accuracy and efficiency of diagnosing a significant medical condition.
Key Points
- Addresses the critical gap in medical image-text alignment where general-domain VLM like CLIP perform poorly, particularly for ophthalmological cross-modal retrieval.
- Proposes a novel knowledge-enhanced joint embedding framework utilizing a multimodal transformer architecture to integrate retinal fundus images, clinical text, and structured patient data.
- Employs modality-specific encoders: a Vision Transformer (ViT-B/16) for images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured patient features.
- Trains the model using a comprehensive set of objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading (ICDR and SDRG schemes).
- Achieves near-perfect text-to-image retrieval performance (99.94% Recall@1 on BRSET) and strong zero-shot generalization (93.95% Recall@1 on DeepEyeNet), significantly outperforming fine-tuned CLIP (1.29% and 0.22% Recall@1 respectively).
- Maintains state-of-the-art classification accuracy for DR severity grading (97.05% for SDRG and 97.97% for ICDR).
- Demonstrates that the multimodal training effectively captures complex cross-modal relationships unique to the medical domain, leading to robust diagnostic capabilities.
Methodology
The proposed framework integrates three modalities via a multimodal transformer architecture. Retinal fundus images are processed by a Vision Transformer (ViT-B/16), clinical narratives by Bio-ClinicalBERT, and structured demographic/clinical features by a multilayer perceptron. These modality-specific embeddings are then fused and jointly processed by a transformer with modality-specific embeddings. The model is trained using multiple objectives: contrastive losses to align representations across modality pairs, reconstruction losses for images and text, and classification losses to predict DR severity grades according to ICDR and SDRG schemes. Performance was evaluated on the BRSET dataset and validated with zero-shot evaluation on DeepEyeNet.
Key Findings
The framework achieved near-perfect text-to-image retrieval with a Recall@1 of 99.94% on the BRSET dataset, vastly outperforming fine-tuned CLIP (1.29%). It also demonstrated strong zero-shot generalizability on the unseen DeepEyeNet dataset, achieving 93.95% Recall@1 compared to 0.22% for fine-tuned CLIP. Concurrently, it maintained state-of-the-art classification accuracy for DR severity grading, with 97.05% for SDRG and 97.97% for ICDR schemes.
Clinical Impact
This research has significant potential to revolutionize the diagnosis and management of diabetic retinopathy by providing highly accurate and reliable automated diagnostic tools. The superior cross-modal retrieval capabilities could empower clinicians to quickly find relevant patient cases or educational materials based on complex multimodal queries, improve personalized treatment planning, and streamline medical research by facilitating efficient data exploration.
Limitations
The abstract does not explicitly state any limitations of the proposed method or the study.
Future Directions
The abstract does not explicitly mention specific future research directions, but the demonstrated generalizability suggests potential for broader application across other medical imaging and text domains.
Medical Domains
Keywords
Abstract
Diabetic retinopathy (DR) is a leading cause of preventable blindness worldwide, demanding accurate automated diagnostic systems. While general-domain vision-language models like Contrastive Language-Image Pre-Training (CLIP) perform well on natural image tasks, they struggle in medical domain applications, particularly in cross-modal retrieval for ophthalmological images. We propose a novel knowledge-enhanced joint embedding framework that integrates retinal fundus images, clinical text, and structured patient data through a multimodal transformer architecture to address the critical gap in medical image-text alignment. Our approach employs separate encoders for each modality: a Vision Transformer (ViT-B/16) for retinal images, Bio-ClinicalBERT for clinical narratives, and a multilayer perceptron for structured demographic and clinical features. These modalities are fused through a joint transformer with modality-specific embeddings, trained using multiple objectives including contrastive losses between modality pairs, reconstruction losses for images and text, and classification losses for DR severity grading according to ICDR and SDRG schemes. Experimental results on the Brazilian Multilabel Ophthalmological Dataset (BRSET) demonstrate significant improvements over baseline models. Our framework achieves near-perfect text-to-image retrieval performance with Recall@1 of 99.94% compared to fine-tuned CLIP's 1.29%, while maintaining state-of-the-art classification accuracy of 97.05% for SDRG and 97.97% for ICDR. Furthermore, zero-shot evaluation on the unseen DeepEyeNet dataset validates strong generalizability with 93.95% Recall@1 versus 0.22% for fine-tuned CLIP. These results demonstrate that our multimodal training approach effectively captures cross-modal relationships in the medical domain, establishing both superior retrieval capabilities and robust diagnostic performance.
Comments
14 pages, 14 figures