Benchmarking Real-World Medical Image Classification with Noisy Labels: Challenges, Practice, and Outlook

arXiv ID: 2512.09315v1

Published: 2025-12-10

Authors: Yuan Ma, Junlin Hou, Chao Zhang, Yukun Zhou, Zongyuan Ge, Haoran Xie, Lie Ju

Categories: cs.CV

Relevance Score: 0.98 / 1.00

View on arXiv Download PDF

Summary

This paper introduces LNMBench, a comprehensive benchmark to systematically assess the robustness of learning with noisy labels (LNL) methods in medical image analysis. It evaluates 10 representative LNL methods across diverse medical datasets, imaging modalities, and noise patterns, revealing significant performance degradation under high and real-world noise. The study highlights persistent challenges like class imbalance and domain variability, and proposes an effective improvement strategy for enhanced model robustness.

Medical Relevance

Noisy labels are an inherent problem in medical imaging due to the complexity, subjectivity, and varying expertise in diagnoses, leading to inconsistent expert annotations. This benchmark directly addresses this clinical reality, ensuring that AI models for medical image analysis are robust and reliable even with imperfect real-world data, which is crucial for their safe and effective integration into clinical practice.

AI Health Application

This research is crucial for developing robust and reliable AI systems for medical diagnosis and image analysis. By benchmarking and improving algorithms that can handle noisy labels, it directly contributes to building AI tools that are more accurate and trustworthy in real-world clinical settings, thereby enhancing decision support for healthcare professionals and potentially improving patient outcomes.

Key Points

Identifies the critical challenge of noisy labels in medical image analysis, driven by expert knowledge demands, inter-observer variability, and inconsistent annotations.
Introduces LNMBench, a unified and reproducible benchmark designed for systematic evaluation of LNL method robustness in medical imaging.
LNMBench encompasses a broad scope, evaluating 10 representative LNL methods across 7 datasets, 6 imaging modalities, and 3 realistic noise patterns.
Demonstrates that existing LNL methods experience substantial performance degradation under high and real-world noisy label conditions.
Highlights class imbalance and domain variability as persistent and critical challenges severely impacting LNL method effectiveness in medical datasets.
Proposes a simple yet effective improvement strategy specifically designed to enhance model robustness in the presence of high-noise, real-world, and imbalanced medical data.
The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for future algorithm development.

Methodology

LNMBench is a comprehensive benchmark framework. It systematically evaluates 10 representative learning with noisy labels (LNL) methods on 7 distinct medical datasets, covering 6 different imaging modalities (e.g., CT, MRI, X-ray, histopathology). The evaluation simulates realistic conditions by employing 3 noise patterns. The framework assesses performance degradation under high and real-world noise, specifically analyzing the impact of class imbalance and domain variability. A simple yet effective improvement strategy is also proposed and tested within this benchmarking framework.

Key Findings

Existing LNL methods demonstrate substantial performance degradation when applied to medical imaging datasets under high and real-world noise conditions. This performance drop is particularly exacerbated by persistent challenges such as prevalent class imbalance and significant domain variability inherent in medical data, indicating that current methods lack sufficient robustness for practical deployment in complex clinical settings.

Clinical Impact

This research provides a critical foundation for developing more trustworthy and robust AI systems in medical diagnosis and analysis. By quantifying the limitations of current LNL methods and offering a standardized evaluation framework, it enables the creation of algorithms that can reliably perform despite the unavoidable noise and inconsistencies in real-world clinical data, potentially leading to more accurate diagnostic support, reduced errors, and safer patient care. The proposed improvement also offers a practical step towards achieving this goal.

Limitations

The abstract implicitly highlights limitations of existing LNL methods, noting their substantial performance degradation under high and real-world noise conditions. It also points to the persistent challenges of class imbalance and domain variability within medical data as factors hindering the effectiveness and robustness of current approaches.

Future Directions

Future research should focus on developing novel noise-resilient algorithms specifically designed to overcome the challenges of high and real-world noise, class imbalance, and domain variability in medical imaging. The LNMBench codebase is intended to serve as a foundational tool for standardized evaluation, promoting reproducible research and facilitating the iterative development and validation of improved LNL methods for clinical applications.

Medical Domains

Medical Image Analysis Radiology Pathology Diagnostic Imaging

Keywords

noisy labels medical image classification benchmarking deep learning robustness inter-observer variability class imbalance domain variability

Abstract

Learning from noisy labels remains a major challenge in medical image analysis, where annotation demands expert knowledge and substantial inter-observer variability often leads to inconsistent or erroneous labels. Despite extensive research on learning with noisy labels (LNL), the robustness of existing methods in medical imaging has not been systematically assessed. To address this gap, we introduce LNMBench, a comprehensive benchmark for Label Noise in Medical imaging. LNMBench encompasses \textbf{10} representative methods evaluated across 7 datasets, 6 imaging modalities, and 3 noise patterns, establishing a unified and reproducible framework for robustness evaluation under realistic conditions. Comprehensive experiments reveal that the performance of existing LNL methods degrades substantially under high and real-world noise, highlighting the persistent challenges of class imbalance and domain variability in medical data. Motivated by these findings, we further propose a simple yet effective improvement to enhance model robustness under such conditions. The LNMBench codebase is publicly released to facilitate standardized evaluation, promote reproducible research, and provide practical insights for developing noise-resilient algorithms in both research and real-world medical applications.The codebase is publicly available on https://github.com/myyy777/LNMBench.