Источник
IEEE Access
Дата публикации
24.03.2025
Авторы
Bashar M. Deeb Андрей Савченко Илья Макаров
Поделиться

Enhancing Emotion Recognition in Speech based on Self-Supervised Learning: Cross-Attention Fusion of Acoustic and Semantic Features

Аннотация

Speech Emotion Recognition has gained considerable attention in speech processing and machine learning due to its potential applications in human-computer interaction, mental health monitoring, and customer service. However, state-of-the-art models for speech emotion recognition use many parameters, which leads to computational complexity. In this paper, we introduce a novel deep-learning model to enhance the accuracy of emotional content detection in speech signals while maintaining a lightweight architecture compared to state-of-the-art models. The proposed model incorporates a feature encoder that significantly improves the emotional representation of acoustic features and a cross-attention mechanism to fuse acoustic features, such as Spectrograms, with semantic features extracted from the pre-trained self-supervised learning framework, enriching the emotional context. An extensive experimental study demonstrates that the proposed model achieves a weighted accuracy of 74.6% on the IEMOCAP dataset, competitive with the state-of-the-art baselines, with a latency of 24 milliseconds on moderate devices while containing up to 3 times lower number of parameters.

Присоединяйтесь к AIRI в соцсетях