MEGFormer: enhancing speech decoding from brain activity through extended semantic representations

Decoding speech, Contrastive Learning, Brain-computer interface, CNNtransformer, MEG

Abstract

Even though multiple studies have examined the decoding of speech from brain activity through non-invasive technologies in recent years, the task still presents a challenge as decoding quality is still insufficient for practical applications. An effective solution could help in the advancement of brain-computer interfaces (BCIs), potentially enabling communication restoration for individuals experiencing speech impairments. At the same time, these studies can provide fundamental insights into how the brain processes speech and sound. One of the approaches for decoding perceived speech involves using a self-supervised model that has been trained using contrastive learning. This model matches segments of the same length from magnetoencephalography (MEG) to audio in a zero-shot way. We improve the method for decoding perceived speech by incorporating a new architecture based on CNN transformer. As a result of proposed modifications, the accuracy of perceived speech decoding increases significantly from the current 69% to 83% and from 67% to 70% on publicly available datasets. Notably, the greatest improvement in accuracy is observed in longer speech fragments that carry semantic meaning, rather than in shorter fragments with sounds and phonemes. Our code is available at https://github.com/maryjis/MEGformer/.

Full text DOWNLOAD pdf