Now You Hear Me: Audio Narrative Attacks Against Large Audio-Language Models

Summary

This paper introduces a novel text-to-audio jailbreak attack, named "Now You Hear Me," that exploits large audio-language models (ALMs) by embedding disallowed directives within narrative-style audio streams. Leveraging advanced text-to-speech (TTS) models to manipulate acoustic and structural properties, the attack successfully bypasses ALM safety mechanisms with a 98.26% success rate against models like Gemini 2.0 Flash. The findings highlight a critical vulnerability in speech-based AI interfaces and emphasize the urgent need for multimodal safety frameworks that analyze both linguistic and paralinguistic features.

Medical Relevance

As ALMs are increasingly integrated into healthcare for applications like voice assistants, patient education, and clinical triage, this research exposes a critical security flaw. The ability to inject malicious or restricted directives via audio could lead to the generation of harmful medical advice, misinformation, or breaches of patient confidentiality, directly impacting patient safety and clinical integrity.

AI Health Application

The AI application to health discussed is the use of large audio-language models in healthcare settings, specifically for tasks like clinical triage. These models would typically assist in initial patient assessment, guiding patients, or providing information, where their susceptibility to 'narrative audio attacks' could compromise the integrity and safety of healthcare delivery.

Key Points

  • Identifies a new class of vulnerabilities in large audio-language models (ALMs) that operate on raw speech inputs.
  • Proposes a text-to-audio jailbreak attack that embeds disallowed directives into narrative-style audio streams.
  • Utilizes an advanced instruction-following text-to-speech (TTS) model to exploit the structural and acoustic properties of speech.
  • Designed to circumvent safety mechanisms primarily calibrated for text-based inputs.
  • Successfully demonstrated against state-of-the-art ALMs, including Gemini 2.0 Flash.
  • Achieves a high success rate of 98.26% in eliciting restricted outputs, significantly exceeding text-only attack baselines.
  • Stresses the imperative for developing safety frameworks that jointly reason over linguistic and paralinguistic representations.

Methodology

The methodology involved designing a text-to-audio jailbreak by first crafting disallowed textual directives. These directives were then fed into an advanced instruction-following text-to-speech (TTS) model. The TTS model was specifically used to embed these directives within a narrative-style audio stream, exploiting inherent structural (e.g., phrasing, cadence) and acoustic (e.g., intonation, emphasis) properties of synthetic speech. This generated malicious audio was subsequently used as input to test the robustness of state-of-the-art large audio-language models, such as Gemini 2.0 Flash, assessing their ability to resist generating restricted outputs.

Key Findings

The key finding is the extreme effectiveness of the proposed audio narrative attack, achieving a remarkable 98.26% success rate in compelling state-of-the-art audio-language models (ALMs) like Gemini 2.0 Flash to produce restricted or disallowed outputs. This substantially outperforms traditional text-only jailbreak methods, demonstrating that current ALM safety mechanisms are highly susceptible to sophisticated audio-based manipulations that leverage paralinguistic cues and narrative structures.

Clinical Impact

This research has profound clinical impact, indicating that ALM-powered tools in healthcare are vulnerable to subtle audio attacks. For instance, a malicious actor could prompt an AI-driven clinical triage system to provide dangerously incorrect medical advice, a patient information bot to leak sensitive data, or a medical voice assistant to generate unethical content. This necessitates an immediate paradigm shift in designing secure medical AI, moving towards robust multimodal validation to prevent potentially life-threatening or privacy-compromising scenarios in clinical practice.

Limitations

The abstract does not explicitly state study limitations. However, an implied limitation is the focus on demonstrating the existence and efficacy of the attack rather than providing concrete defensive solutions. The generalizability across all possible TTS models or ALM architectures is also not fully explored within the scope of the abstract, nor are the computational costs of developing such an attack or robust defenses.

Future Directions

The paper strongly advocates for future research into developing new safety frameworks for large audio-language models that can jointly reason over both linguistic (what is said) and paralinguistic (how it is said) representations. This implies a need for advanced security mechanisms capable of detecting malicious intent embedded in acoustic features, intonation, and narrative structure within audio inputs, crucial for securing speech-based AI in critical applications like healthcare.

Medical Domains

clinical triage telemedicine digital health medical voice assistants patient education platforms clinical decision support systems

Keywords

audio-language models jailbreak attack text-to-speech vulnerability safety mechanisms clinical triage paralinguistic AI security

Abstract

Large audio-language models increasingly operate on raw speech inputs, enabling more seamless integration across domains such as voice assistants, education, and clinical triage. This transition, however, introduces a distinct class of vulnerabilities that remain largely uncharacterized. We examine the security implications of this modality shift by designing a text-to-audio jailbreak that embeds disallowed directives within a narrative-style audio stream. The attack leverages an advanced instruction-following text-to-speech (TTS) model to exploit structural and acoustic properties, thereby circumventing safety mechanisms primarily calibrated for text. When delivered through synthetic speech, the narrative format elicits restricted outputs from state-of-the-art models, including Gemini 2.0 Flash, achieving a 98.26% success rate that substantially exceeds text-only baselines. These results highlight the need for safety frameworks that jointly reason over linguistic and paralinguistic representations, particularly as speech-based interfaces become more prevalent.

Comments

to be published at EACL 2026 main conference