A multimodal Bayesian Network for symptom-level depression and anxiety prediction from voice and speech data

Summary

This paper introduces a multimodal Bayesian Network for predicting symptom-level depression and anxiety from voice and speech data, evaluating its performance on a large dataset of over 30,000 speakers. The model demonstrates high predictive accuracy (ROC-AUC > 0.8), good calibration, and demographic fairness, proposing a transparent and explainable tool to support clinical psychiatric assessment.

Medical Relevance

This research addresses a critical need in psychiatric assessment by providing an objective, AI-driven method to integrate nonverbal cues (voice, speech) for symptom-level depression and anxiety prediction, potentially enhancing diagnostic support, monitoring, and personalized care.

AI Health Application

The paper describes a medical AI application in the form of a multimodal Bayesian Network designed to act as an 'assessment support tool' for clinicians. This AI system predicts symptoms of depression and anxiety from patient voice and speech data, aiming to enhance and aid psychiatric assessment by integrating various nonverbal cues.

Key Points

  • Developed a multimodal Bayesian Network model for symptom-level prediction of depression and anxiety.
  • Utilizes voice and speech features as primary input data streams.
  • Evaluated on a large-scale dataset comprising 30,135 unique speakers.
  • Achieved high predictive performance: Depression (ROC-AUC=0.842, ECE=0.018) and Anxiety (ROC-AUC=0.831, ECE=0.015), with core individual symptom ROC-AUCs > 0.74.
  • Assessed and addressed demographic fairness of the model's predictions.
  • Investigated the integration across and redundancy between different input modality types.
  • Emphasizes clinical usefulness metrics and acceptability to mental health service users, aiming for transparent, explainable, and expert-supervisable outputs.

Methodology

The study employs a multimodal Bayesian Network model designed to predict symptom-level depression and anxiety. Input data consists of various voice and speech features extracted from a large cohort of 30,135 unique speakers. The model's performance was rigorously evaluated using metrics such as Receiver Operating Characteristic Area Under the Curve (ROC-AUC) and Expected Calibration Error (ECE) for both overall conditions and individual core symptoms. Further assessments included demographic fairness, analysis of input modality integration and redundancy, and exploration of clinical usefulness metrics and acceptability among mental health service users.

Key Findings

The multimodal Bayesian Network achieved significant predictive power, with ROC-AUCs of 0.842 for depression and 0.831 for anxiety, alongside low ECE values (0.018 and 0.015 respectively), indicating strong performance and calibration. Prediction for core individual symptoms also showed good performance (ROC-AUC > 0.74). The model demonstrated demographic fairness and provided insights into how different voice and speech modalities integrate. Importantly, the outputs are designed to be clinically relevant at the symptom level, transparent, explainable, and directly amenable to expert clinical supervision, addressing key barriers to clinical adoption.

Clinical Impact

This work offers a significant step towards developing robust, intelligence-driven assessment support tools for mental health. By providing transparent, explainable, and symptom-level predictions from readily available voice and speech data, the model can assist clinicians in objectively integrating nonverbal cues, potentially leading to earlier detection, more consistent diagnosis, and improved monitoring of depression and anxiety, thereby augmenting current psychiatric assessment practices.

Limitations

The abstract does not explicitly detail specific limitations of the developed model itself. It notes that intelligence-driven tools are 'yet to be realized in the clinic,' implying a current gap in practical adoption that this research aims to overcome rather than a limitation of the proposed solution.

Future Directions

While explicit future research directions are not detailed in the abstract, the paper advocates for such models as a 'principled approach for building robust assessment support tools,' suggesting ongoing work towards broader clinical implementation, refinement, and integration into routine psychiatric care to realize intelligence-driven tools in clinical practice.

Medical Domains

Psychiatry Mental Health Clinical Psychology Neuropsychiatry

Keywords

Bayesian Network depression anxiety voice analysis speech processing symptom prediction psychiatric assessment explainable AI

Abstract

During psychiatric assessment, clinicians observe not only what patients report, but important nonverbal signs such as tone, speech rate, fluency, responsiveness, and body language. Weighing and integrating these different information sources is a challenging task and a good candidate for support by intelligence-driven tools - however this is yet to be realized in the clinic. Here, we argue that several important barriers to adoption can be addressed using Bayesian network modelling. To demonstrate this, we evaluate a model for depression and anxiety symptom prediction from voice and speech features in large-scale datasets (30,135 unique speakers). Alongside performance for conditions and symptoms (for depression, anxiety ROC-AUC=0.842,0.831 ECE=0.018,0.015; core individual symptom ROC-AUC>0.74), we assess demographic fairness and investigate integration across and redundancy between different input modality types. Clinical usefulness metrics and acceptability to mental health service users are explored. When provided with sufficiently rich and large-scale multimodal data streams and specified to represent common mental conditions at the symptom rather than disorder level, such models are a principled approach for building robust assessment support tools: providing clinically-relevant outputs in a transparent and explainable format that is directly amenable to expert clinical supervision.