非侵入性脑记录的解码语音感知

论文标题

非侵入性脑记录的解码语音感知

Decoding speech perception from non-invasive brain recordings

论文作者

Défossez, Alexandre, Caucheteux, Charlotte, Rapin, Jérémy, Kabeli, Ori, King, Jean-Rémi

论文摘要

大脑活动的解码语音是医疗保健和神经科学中期待已久的目标。入侵设备最近导致了这方面的主要里程碑：对颅内记录训练的深度学习算法现在开始解码基本语言特征（例如字母，单词，频谱图）。但是，将这种方法扩展到自然语音和非侵入性脑记录仍然是一个主要挑战。在这里，我们介绍了一种经过对比学习训练的模型，以从大量健康个体的非侵入性记录中解释自我监督的表达方式。为了评估这种方法，我们策划并整合了四个公共数据集，其中包括175名用磁性或电脑电图记录的志愿者（M/EEG），同时他们听了短篇小说和孤立的句子。结果表明，我们的模型可以从3秒钟的MEG信号中识别出相应的语音细分市场，在参与者的平均1,000多种不同的可能性中，具有高达41％的精度，并且在最佳参与者中超过80％以上 - 允许在培训集中使用单词和短语的表现。我们的模型与各种基线的比较强调了（i）对比目标的重要性，（ii）语音的预验证表示和（iii）一种共同的卷积体系结构，同时在多个参与者中进行了培训。最后，对解码器预测的分析表明，它们主要取决于词汇和上下文语义表示。总体而言，这种有效的对非侵入性记录的感知语音的解码描述了从大脑活动中解码语言的有前途的途径，而不会使患者处于脑部手术的危险中。

Decoding speech from brain activity is a long-awaited goal in both healthcare and neuroscience. Invasive devices have recently led to major milestones in that regard: deep learning algorithms trained on intracranial recordings now start to decode elementary linguistic features (e.g. letters, words, spectrograms). However, extending this approach to natural speech and non-invasive brain recordings remains a major challenge. Here, we introduce a model trained with contrastive-learning to decode self-supervised representations of perceived speech from the non-invasive recordings of a large cohort of healthy individuals. To evaluate this approach, we curate and integrate four public datasets, encompassing 175 volunteers recorded with magneto- or electro-encephalography (M/EEG), while they listened to short stories and isolated sentences. The results show that our model can identify, from 3 seconds of MEG signals, the corresponding speech segment with up to 41% accuracy out of more than 1,000 distinct possibilities on average across participants, and more than 80% in the very best participants - a performance that allows the decoding of words and phrases absent from the training set. The comparison of our model to a variety of baselines highlights the importance of (i) a contrastive objective, (ii) pretrained representations of speech and (iii) a common convolutional architecture simultaneously trained across multiple participants. Finally, the analysis of the decoder's predictions suggests that they primarily depend on lexical and contextual semantic representations. Overall, this effective decoding of perceived speech from non-invasive recordings delineates a promising path to decode language from brain activity, without putting patients at risk for brain surgery.

下载PDF全文

下载文献需遵守相关版权规定

论文标题