Source
ECAI
DATE OF PUBLICATION
10/19/2024
Authors
Ilya Makarov
Andrey Savchenko
Bashar M. Deeb
Share
CA-SER: Cross-Attention Feature Fusion for Speech Emotion Recognition
Abstract
In this paper, we introduce a novel tool for speech emotion recognition, CA-SER, that borrows self-supervised learning to extract semantic speech representations from a pre-trained wav2vec 2.0 model and combine them with spectral audio features to improve speech emotion recognition. Our approach involves a self-attention encoder on MFCC features to capture meaningful patterns in audio sequences. These MFCC features are combined with high-level representations using a multi-head cross-attention mechanism. Evaluation of speech emotion recognition on the IEMOCAP dataset shows that our system achieves a weighted accuracy of 74.6%, outperforming most existing techniques.
Similar publications
You can ask us a question or suggest a joint project in the field of AI
partner@airi.net
For scientific cooperation and
partnership
partnership
pr@airi.net
For journalists and media