NeuroABench: A Multimodal Evaluation Benchmark for Neurosurgical Anatomy Identification
Summary
This paper introduces NeuroABench, the first multimodal benchmark specifically designed to evaluate Multimodal Large Language Models' (MLLMs) ability to identify neurosurgical anatomy from video. It reveals that current state-of-the-art MLLMs have significant limitations in this critical area, achieving only 40.87% accuracy, which is comparable to a novice trainee but substantially below the average neurosurgical trainee's performance.
Medical Relevance
Precise anatomical understanding is fundamental for safe and effective surgical practice. This research aims to advance AI capabilities in interpreting surgical videos to enhance surgical education, assist in pre-operative planning, and provide real-time intraoperative guidance, ultimately improving patient outcomes.
AI Health Application
The paper evaluates and aims to improve MLLMs for precise anatomical understanding in neurosurgical videos. This AI application can be used to develop tools for enhancing surgical education and training (e.g., interactive learning platforms, automated assessment), providing real-time surgical assistance (e.g., anatomical landmark identification during surgery), and aiding in pre-operative planning and post-operative review for surgeons.
Key Points
- Existing MLLMs for surgical video understanding primarily focus on procedures and workflows, neglecting critical anatomical comprehension required by surgeons.
- NeuroABench is introduced as the first multimodal benchmark for evaluating anatomical understanding in neurosurgical videos.
- The benchmark comprises 9 hours of annotated neurosurgical videos covering 89 distinct procedures and identifying 68 clinical anatomical structures.
- A novel multimodal annotation pipeline with multiple review cycles was employed to ensure the quality and accuracy of the NeuroABench dataset.
- Experiments on over 10 state-of-the-art MLLMs showed significant limitations, with the best model achieving only 40.87% accuracy in anatomical identification tasks.
- An informative test with four neurosurgical trainees revealed an average accuracy of 46.5%, with the best student at 56% and the lowest at 28%.
- The best MLLM performs comparably to the lowest-scoring student but lags significantly behind the group's average, highlighting a substantial gap in human-level anatomical understanding.
Methodology
The study involved creating NeuroABench through a novel multimodal annotation pipeline with multiple review cycles, annotating 9 hours of neurosurgical video for 68 anatomical structures across 89 procedures. Model performance was evaluated by testing over 10 state-of-the-art MLLMs on anatomical identification tasks using this benchmark. Human performance was assessed by conducting an informative test on a subset of the data with four neurosurgical trainees for comparison.
Key Findings
State-of-the-art MLLMs exhibit significant limitations in neurosurgical anatomical identification, with the best model achieving only 40.87% accuracy. This performance is notably below that of experienced neurosurgical trainees, who averaged 46.5% accuracy (with a range of 28%-56%), indicating a substantial gap between current AI capabilities and human-level comprehension.
Clinical Impact
NeuroABench provides a crucial, standardized benchmark to drive the development of more accurate and robust MLLMs for neurosurgical applications. Improved anatomical identification by MLLMs can significantly enhance surgical education, offer advanced tools for surgical planning, and potentially provide real-time intraoperative assistance by highlighting critical structures, thereby contributing to increased surgical precision and patient safety.
Limitations
The primary limitation is the significant performance gap between current MLLMs and human neurosurgical trainees in anatomical identification. The human comparison group was small (four trainees), serving as an informative test rather than a large-scale validation. The benchmark is specific to neurosurgery, and its direct applicability to other surgical domains is not addressed.
Future Directions
Future research should focus on developing more sophisticated MLLM architectures and training methodologies to improve anatomical understanding, bridging the identified performance gap with human surgeons. Expansion of such benchmarks to other surgical specialties could also be a valuable next step, leading to broader applications in surgical AI.
Medical Domains
Keywords
Abstract
Multimodal Large Language Models (MLLMs) have shown significant potential in surgical video understanding. With improved zero-shot performance and more effective human-machine interaction, they provide a strong foundation for advancing surgical education and assistance. However, existing research and datasets primarily focus on understanding surgical procedures and workflows, while paying limited attention to the critical role of anatomical comprehension. In clinical practice, surgeons rely heavily on precise anatomical understanding to interpret, review, and learn from surgical videos. To fill this gap, we introduce the Neurosurgical Anatomy Benchmark (NeuroABench), the first multimodal benchmark explicitly created to evaluate anatomical understanding in the neurosurgical domain. NeuroABench consists of 9 hours of annotated neurosurgical videos covering 89 distinct procedures and is developed using a novel multimodal annotation pipeline with multiple review cycles. The benchmark evaluates the identification of 68 clinical anatomical structures, providing a rigorous and standardized framework for assessing model performance. Experiments on over 10 state-of-the-art MLLMs reveal significant limitations, with the best-performing model achieving only 40.87% accuracy in anatomical identification tasks. To further evaluate the benchmark, we extract a subset of the dataset and conduct an informative test with four neurosurgical trainees. The results show that the best-performing student achieves 56% accuracy, with the lowest scores of 28% and an average score of 46.5%. While the best MLLM performs comparably to the lowest-scoring student, it still lags significantly behind the group's average performance. This comparison underscores both the progress of MLLMs in anatomical understanding and the substantial gap that remains in achieving human-level performance.
Comments
Accepted by IEEE ICIA 2025