Multi-modal Models for Depression Screening

Flores, Ricardo

Etd

Multi-modal Models for Depression Screening

Public Deposited

Depression is a very common mental health disorder. Due to the limited number of trained clinicians, mental health screening is very costly. By contrast, with the increasing technology in audio-visual speech recognition, a virtual interviewer may represent an affordable alternative for depression screening. Prior research has focused on classifying depression with high performance using either audio, text, or images as input. These modalities introduce new challenges for training a virtual interviewer using deep learning, including, small sample size, multi-modalities, and long recording video-audio. First, I study voice recording to quantify the effect of including follow-up questions in clinical interviews for depression screening. Training transfer learning models on the popular Distress Analysis Interview Corpus - Wizard-of-Oz (DAIC-WOZ). Second, I move a step forward and propose AudiFace, a multi-modal deep learning model that consumes temporal facial features, recorded audio, and transcripts to screen for depression. AudiFace combines pre-trained transfer learning models and bidirectional LSTM with self-attention layers to capture the long relationships within sequences. However, AudiFace uses a simple concatenation of the three uni-modal embeddings into one representation. Hence, I also propose WavFace, a multimodal transformer-based model that inputs audio and temporal facial features. WavFace applies an explicit alignment method for both modalities and then uses sequential and spatial self-attention over the alignment for depression screening. Finally, to tackle the above-mentioned small datasets challenge, common in the mental health community, I leverage multi-task learning with auxiliary tasks to boost mental health performance. The two tasks are depression screening and post-traumatic stress disorder (PTSD), while the auxiliary task is the missing value imputation. The results achieved in all 15 datasets from DAIC-WOZ suggest that these advanced strategies of both multi-model and multi-task learning allow for improvement in the uni-modal representation, and consequently the depression screening metrics. I believe these models provide valuable findings for the future of both mental health screening applications, as well as clinical screening interviews.

Creator