Strong Reasoning Isn't Enough: Evaluating Evidence Elicitation in Interactive Diagnosis
Summary
This paper introduces an interactive evaluation framework for medical diagnostic agents, addressing the oversight of evidence-gathering processes in existing evaluations. It proposes Information Coverage Rate (ICR) and the EviMed benchmark to quantify and systematically study evidence elicitation. The core finding is that strong diagnostic reasoning alone is insufficient for effective information collection, leading to a performance bottleneck, which is mitigated by the proposed REFINE strategy.
Medical Relevance
This research is crucial for advancing AI in clinical diagnosis by ensuring that automated agents not only possess strong reasoning but also effectively gather comprehensive patient information. This can lead to more accurate and reliable diagnostic support systems, reducing misdiagnoses and improving patient outcomes in interactive medical consultations.
AI Health Application
The paper focuses on developing and evaluating AI agents (likely large language models or similar) to conduct effective interactive medical consultations. The goal is to improve the AI's ability to proactively gather necessary clinical evidence, leading to more accurate diagnoses. This could be applied in AI-powered diagnostic assistants for healthcare professionals, intelligent symptom checkers for patients, or training tools for medical students simulating patient interactions.
Key Points
- Identifies a critical gap in current evaluations of interactive medical diagnostic agents: the neglect of the evidence-gathering process.
- Proposes an interactive evaluation framework utilizing a simulated patient and reporter, grounded in atomic evidences, to model the consultation process explicitly.
- Introduces Information Coverage Rate (ICR) as a novel metric to quantify the completeness of evidence elicitation by an agent during interaction.
- Develops EviMed, an evidence-based benchmark dataset encompassing diverse medical conditions, from common complaints to rare diseases, for systematic study.
- Demonstrates through evaluating 10 models that strong diagnostic reasoning capabilities do not guarantee effective evidence elicitation, pinpointing this insufficiency as a primary performance bottleneck.
- Proposes REFINE, a strategy that leverages diagnostic verification to guide the agent in proactively resolving diagnostic uncertainties and improving evidence collection.
- Shows that REFINE consistently outperforms baselines, facilitates effective model collaboration, and enables smaller agents to achieve superior performance under strong reasoning supervision.
Methodology
The methodology involves an interactive evaluation framework with a simulated patient and a simulated reporter, both grounded in atomic evidences. It introduces a new metric, Information Coverage Rate (ICR), to quantify evidence collection completeness. An evidence-based benchmark, EviMed, was developed and used to evaluate 10 different diagnostic models. The proposed REFINE strategy integrates diagnostic verification to guide evidence elicitation.
Key Findings
The primary finding is that strong diagnostic reasoning is not sufficient for effective evidence elicitation, creating a significant bottleneck in interactive diagnostic performance. The proposed REFINE strategy, leveraging diagnostic verification, consistently improves evidence collection and diagnostic accuracy, outperforming baselines and enabling efficient model collaboration.
Clinical Impact
This work has the potential to significantly enhance the clinical utility of AI-powered diagnostic tools by addressing a fundamental limitation: incomplete information gathering. By improving the ability of AI agents to proactively elicit necessary evidence, it could lead to more thorough and accurate diagnoses, fewer diagnostic errors, and more efficient clinical workflows in real-world medical consultations. It could also contribute to developing better training systems for medical professionals.
Limitations
The abstract does not explicitly state specific limitations or caveats of the study, but the use of a simulated patient and reporter, while useful for systematic evaluation, inherently carries the limitation of not fully replicating the complexities and nuances of real-world human-patient interactions.
Future Directions
While not explicitly detailing future research, the paper's focus on addressing the 'primary bottleneck' of insufficient information collection with the REFINE strategy suggests continued efforts would be directed towards refining such elicitation strategies and integrating them into more robust diagnostic AI systems. Further work could explore the transition from simulated to real-world clinical data and interactions.
Medical Domains
Keywords
Abstract
Interactive medical consultation requires an agent to proactively elicit missing clinical evidence under uncertainty. Yet existing evaluations largely remain static or outcome-centric, neglecting the evidence-gathering process. In this work, we propose an interactive evaluation framework that explicitly models the consultation process using a simulated patient and a \rev{simulated reporter} grounded in atomic evidences. Based on this representation, we introduce Information Coverage Rate (ICR) to quantify how completely an agent uncovers necessary evidence during interaction. To support systematic study, we build EviMed, an evidence-based benchmark spanning diverse conditions from common complaints to rare diseases, and evaluate 10 models with varying reasoning abilities. We find that strong diagnostic reasoning does not guarantee effective information collection, and this insufficiency acts as a primary bottleneck limiting performance in interactive settings. To address this, we propose REFINE, a strategy that leverages diagnostic verification to guide the agent in proactively resolving uncertainties. Extensive experiments demonstrate that REFINE consistently outperforms baselines across diverse datasets and facilitates effective model collaboration, enabling smaller agents to achieve superior performance under strong reasoning supervision. Our code can be found at https://github.com/NanshineLoong/EID-Benchmark .