Metric-Fair Prompting: Treating Similar Samples Similarly

Summary

This paper introduces Metric-Fair Prompting, a novel framework guiding Large Language Models (LLMs) to make decisions under individual metric-fairness constraints. By processing similar medical questions in joint pairs and imposing a Lipschitz-style constraint on confidence scores, the method treats similar samples similarly, significantly improving LLM accuracy on the MedQA (US) benchmark for multiple-choice medical question answering.

Medical Relevance

This research is highly relevant to medicine as it aims to improve the accuracy, reliability, and ethical fairness of AI systems used in critical clinical applications like medical education, diagnostic support, and clinical decision-making. By ensuring consistent responses to similar medical scenarios, it helps build trust and mitigate potential biases in AI-assisted healthcare.

AI Health Application

This AI framework improves the accuracy and fairness of LLMs when answering medical multiple-choice questions. Potential applications include enhanced tools for medical education, assessment of clinical knowledge, and as a component for AI-driven clinical decision support systems where robust and fair reasoning is critical.

Key Points

  • Metric-Fair Prompting is a fairness-aware prompting framework designed for LLMs, specifically targeting individual fairness by treating similar instances similarly.
  • The framework is applied to multiple-choice medical question answering, treating each (question, option) pair as a binary instance (correct/incorrect).
  • Question similarity is computed using NLP embeddings, enabling the system to solve items in joint pairs of similar questions rather than in isolation.
  • The prompt enforces a global decision protocol: extract decisive clinical features, map each (question, option) to a confidence score $f(x)$, and apply a Lipschitz-style constraint.
  • The Lipschitz-style constraint ensures that similar inputs receive similar confidence scores, thereby leading to consistent outputs for similar medical questions.
  • Evaluated on the MedQA (US) benchmark, Metric-Fair Prompting demonstrated improved performance over standard single-item prompting.
  • The research suggests that fairness-guided, confidence-oriented reasoning can significantly enhance LLM accuracy in high-stakes clinical multiple-choice question scenarios.

Methodology

The authors developed Metric-Fair Prompting, a framework that leverages NLP embeddings to compute question similarity. This allows for the joint processing of similar medical multiple-choice questions. The method guides LLMs to extract decisive clinical features and assign a confidence score $f(x)$ to each (question, option) pair. A Lipschitz-style constraint is then imposed to ensure that similar inputs yield similar scores and consistent outputs, promoting individual fairness. Performance was evaluated on the MedQA (US) benchmark against standard single-item prompting.

Key Findings

Metric-Fair Prompting significantly improved the performance (accuracy) of Large Language Models on the MedQA (US) benchmark for multiple-choice medical question answering compared to traditional single-item prompting. This demonstrates the efficacy of fairness-guided, confidence-oriented reasoning in enhancing LLM accuracy in high-stakes clinical contexts.

Clinical Impact

This method could lead to more accurate, reliable, and ethically fair AI systems in clinical practice. It has the potential to enhance tools for medical diagnosis, treatment planning, and medical education by ensuring consistency in responses to similar patient cases or clinical queries. This improved trustworthiness could facilitate greater adoption and impact of AI in critical healthcare decision-making.

Limitations

The abstract does not explicitly state any limitations or caveats of the research.

Future Directions

The abstract does not explicitly state future research directions.

Medical Domains

Clinical Medicine Medical Education Diagnostic Support Medical AI Ethics Healthcare Informatics

Keywords

Metric-Fair Prompting LLMs individual fairness medical question answering NLP embeddings Lipschitz constraint MedQA clinical decision support

Abstract

We introduce \emph{Metric-Fair Prompting}, a fairness-aware prompting framework that guides large language models (LLMs) to make decisions under metric-fairness constraints. In the application of multiple-choice medical question answering, each {(question, option)} pair is treated as a binary instance with label $+1$ (correct) or $-1$ (incorrect). To promote {individual fairness}~--~treating similar instances similarly~--~we compute question similarity using NLP embeddings and solve items in \emph{joint pairs of similar questions} rather than in isolation. The prompt enforces a global decision protocol: extract decisive clinical features, map each \((\text{question}, \text{option})\) to a score $f(x)$ that acts as confidence, and impose a Lipschitz-style constraint so that similar inputs receive similar scores and, hence, consistent outputs. Evaluated on the {MedQA (US)} benchmark, Metric-Fair Prompting is shown to improve performance over standard single-item prompting, demonstrating that fairness-guided, confidence-oriented reasoning can enhance LLM accuracy on high-stakes clinical multiple-choice questions.

Journal Reference

NeurIPS 2025 Workshop on Socially Responsible and Trustworthy Foundation Models