From SAM to DINOv2: Towards Distilling Foundation Models to Lightweight Baselines for Generalized Polyp Segmentation

Summary

The paper introduces Polyp-DiFoM, a novel distillation framework that transfers rich representations from large vision foundation models (like SAM, DINOv2) into lightweight segmentation baselines (U-Net, U-Net++) for generalized polyp segmentation. This approach effectively bridges the gap between powerful foundation models and practical medical imaging applications, demonstrating significant performance improvements over baselines and state-of-the-art models with nearly 9 times reduced computational overhead across five benchmark datasets.

Medical Relevance

Accurate and early detection of polyps during colonoscopy is crucial for preventing colorectal cancer, one of the leading causes of cancer-related deaths. This research significantly improves the reliability and efficiency of automated polyp segmentation, directly contributing to enhanced diagnostic capabilities and potentially better patient outcomes by allowing precise and timely intervention.

AI Health Application

AI-assisted detection and segmentation of polyps in colonoscopy images/videos, aiding clinicians in the early diagnosis and screening of colorectal cancer, thereby improving diagnostic accuracy and efficiency in clinical settings.

Key Points

  • Addresses the challenge of accurate polyp segmentation, which is difficult due to variations and camouflage, and the limitations of both lightweight models (poor performance) and foundation models (hard to transfer to medical data).
  • Proposes Polyp-DiFoM, a novel knowledge distillation framework, to leverage the impressive generalization capabilities of large-scale vision foundation models.
  • Polyp-DiFoM transfers semantic priors and rich representations from foundation models (SAM, DINOv2, OneFormer, Mask2Former) into more efficient, canonical architectures like U-Net and U-Net++.
  • Enhances the distillation process by incorporating frequency domain encoding for improved transfer of knowledge and generalization capability.
  • Extensively evaluated across five diverse benchmark datasets: Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300, demonstrating broad applicability.
  • Achieves consistent and significant outperformance compared to respective baseline models and the state-of-the-art, while drastically reducing computational overhead by approximately 9 times.
  • Facilitates the efficient and accurate deployment of advanced segmentation models in real-world clinical settings, addressing computational constraints and data scarcity in medical imaging.

Methodology

The authors propose Polyp-DiFoM, a knowledge distillation framework. This involves using large-scale vision foundation models (e.g., SAM, DINOv2, OneFormer, Mask2Former) as 'teachers' to infuse 'semantic priors' and 'rich representations' into lightweight 'student' segmentation baselines (e.g., U-Net, U-Net++). The distillation process is further enhanced by incorporating frequency domain encoding. The framework was extensively evaluated on five benchmark datasets: Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300.

Key Findings

The Polyp-DiFoM framework consistently and significantly outperforms both lightweight baseline models (U-Net, U-Net++) and current state-of-the-art models for polyp segmentation across all evaluated datasets. A crucial finding is its ability to achieve this superior performance with a substantial reduction in computational overhead, approximately 9 times less than existing methods, indicating high efficiency and practical deployability.

Clinical Impact

The proposed Polyp-DiFoM model can significantly improve the accuracy and efficiency of automated polyp detection during colonoscopies, leading to earlier and more reliable diagnosis of colorectal cancer and facilitating more effective treatment planning. Its reduced computational requirements make it highly suitable for deployment in various clinical environments, including those with limited computational resources, thus democratizing access to advanced AI-driven diagnostic tools for critical cancer screening.

Limitations

Not explicitly mentioned in the abstract.

Future Directions

Not explicitly mentioned in the abstract.

Medical Domains

Gastroenterology Oncology Diagnostic Imaging Medical Artificial Intelligence

Keywords

Polyp segmentation Colorectal cancer Foundation models Knowledge distillation Deep learning U-Net SAM DINOv2 Medical imaging

Abstract

Accurate polyp segmentation during colonoscopy is critical for the early detection of colorectal cancer and still remains challenging due to significant size, shape, and color variations, and the camouflaged nature of polyps. While lightweight baseline models such as U-Net, U-Net++, and PraNet offer advantages in terms of easy deployment and low computational cost, they struggle to deal with the above issues, leading to limited segmentation performance. In contrast, large-scale vision foundation models such as SAM, DINOv2, OneFormer, and Mask2Former have exhibited impressive generalization performance across natural image domains. However, their direct transfer to medical imaging tasks (e.g., colonoscopic polyp segmentation) is not straightforward, primarily due to the scarcity of large-scale datasets and lack of domain-specific knowledge. To bridge this gap, we propose a novel distillation framework, Polyp-DiFoM, that transfers the rich representations of foundation models into lightweight segmentation baselines, allowing efficient and accurate deployment in clinical settings. In particular, we infuse semantic priors from the foundation models into canonical architectures such as U-Net and U-Net++ and further perform frequency domain encoding for enhanced distillation, corroborating their generalization capability. Extensive experiments are performed across five benchmark datasets, such as Kvasir-SEG, CVC-ClinicDB, ETIS, ColonDB, and CVC-300. Notably, Polyp-DiFoM consistently outperforms respective baseline models significantly, as well as the state-of-the-art model, with nearly 9 times reduced computation overhead. The code is available at https://github.com/lostinrepo/PolypDiFoM.