Health-SCORE: Towards Scalable Rubrics for Improving Health-LLMs

Summary

Health-SCORE introduces a novel framework to significantly reduce the cost and effort of creating domain-specific rubrics for evaluating and training Large Language Models (LLMs) in safety-critical healthcare. It achieves evaluation quality comparable to human-created rubrics while simultaneously serving as a structured reward signal for safety-aware reinforcement learning and enhancing response quality via in-context learning. This approach aims to make rubric-based evaluation and training of Health-LLMs substantially more scalable.

Medical Relevance

This work is critical for the safe and effective deployment of AI in healthcare, a safety-critical domain where LLM accuracy and reliability are paramount. By making the evaluation and training of Health-LLMs more scalable and cost-effective, it facilitates the development of AI tools that can be more rigorously vetted and improved before clinical application.

AI Health Application

The paper introduces Health-SCORE, a framework for evaluating, training, and improving Large Language Models (LLMs) specifically designed for healthcare applications. This includes using structured reward signals for reinforcement learning and in-context learning to enhance the quality and safety of LLM responses in medical and healthcare contexts. Its application is to enable more effective and safer development of AI for health.

Key Points

  • Addresses the challenge of high human expertise, time, and cost associated with creating high-quality, domain-specific rubrics for evaluating open-ended LLM responses in healthcare.
  • Introduces Health-SCORE, a generalizable and scalable rubric-based training and evaluation framework specifically designed for Health-LLMs.
  • The framework substantially reduces rubric development costs without compromising on evaluation performance, making it highly efficient.
  • Health-SCORE functions as a structured reward signal to guide reinforcement learning (RL) algorithms with safety-aware supervision, enabling more robust LLM training.
  • It can be directly incorporated into LLM prompts to improve response quality through in-context learning, enhancing model performance during inference.
  • Achieves evaluation quality comparable to traditional human-created rubrics across various open-ended healthcare tasks.
  • The primary implication is a significant increase in the scalability of rubric-based evaluation and training for healthcare-specific LLMs by reducing development effort.

Methodology

Health-SCORE is presented as a generalizable and scalable rubric-based framework designed to reduce the cost and effort of creating high-quality, domain-specific rubrics. It leverages these generated rubrics in two primary ways: first, as a structured reward signal to guide reinforcement learning with safety-aware supervision for LLM training; and second, by directly incorporating them into prompts to enhance LLM response quality through in-context learning during inference.

Key Findings

The main findings indicate that Health-SCORE successfully delivers evaluation quality comparable to that achieved with labor-intensive human-created rubrics across open-ended healthcare tasks. Concurrently, it significantly lowers the development effort required for these rubrics, thereby making rubric-based evaluation and training for Health-LLMs considerably more scalable.

Clinical Impact

Health-SCORE has the potential to accelerate the development and deployment of safer and more accurate AI tools in clinical settings. It could allow healthcare organizations and researchers to more efficiently evaluate and train LLMs for tasks like providing patient information, assisting with clinical summaries, or supporting diagnostic processes, reducing the time and expert resources traditionally required for robust validation. This could lead to more trustworthy AI applications that improve patient care and reduce the risk of AI-generated errors.

Limitations

Not explicitly mentioned in the abstract.

Future Directions

Not explicitly mentioned in the abstract.

Medical Domains

Clinical decision support Patient education and communication Medical information retrieval Diagnostic assistance systems Healthcare quality assurance Medical education

Keywords

LLMs Healthcare AI Rubric-based evaluation Reinforcement Learning In-context learning Scalability Safety-critical domains Medical language models

Abstract

Rubrics are essential for evaluating open-ended LLM responses, especially in safety-critical domains such as healthcare. However, creating high-quality and domain-specific rubrics typically requires significant human expertise time and development cost, making rubric-based evaluation and training difficult to scale. In this work, we introduce Health-SCORE, a generalizable and scalable rubric-based training and evaluation framework that substantially reduces rubric development costs without sacrificing performance. We show that Health-SCORE provides two practical benefits beyond standalone evaluation: it can be used as a structured reward signal to guide reinforcement learning with safety-aware supervision, and it can be incorporated directly into prompts to improve response quality through in-context learning. Across open-ended healthcare tasks, Health-SCORE achieves evaluation quality comparable to human-created rubrics while significantly lowering development effort, making rubric-based evaluation and training more scalable.