使用自我监督的预训练和数据增强的增强直接语音到语音翻译

论文标题

使用自我监督的预训练和数据增强的增强直接语音到语音翻译

Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

论文作者

Popuri, Sravya, Chen, Peng-Jen, Wang, Changhan, Pino, Juan, Adi, Yossi, Gu, Jiatao, Hsu, Wei-Ning, Lee, Ann

论文摘要

与传统的级联系统可用数据量相比，直接语音到语音翻译（S2ST）模型几乎没有平行的S2ST数据遇到数据稀缺问题，这些数据包括自动语音识别（ASR），机器翻译（MT）和文本对语音（TTECT-toepech（TTS）合成）。在这项工作中，我们使用未标记的语音数据和数据扩展来探索自我监督的预训练，以解决此问题。我们利用了最近提出的语音到单位翻译（S2UT）框架，该框架将目标语音编码为离散表示形式，并通过研究语音编码和离散单位解码器预先培训来研究语音到文本翻译（S2T）的训练前和有效的部分芬太尼技术。我们在西班牙语 - 英语翻译上进行的实验表明，与多任务学习相比，自我监督的预训练始终提高模型性能，平均为6.6-12.1 BLEU增益，并且可以将其与应用MT的数据增强技术进一步结合，这些技术将MT应用于较弱的监督培训数据。音频样本可在以下网址获得：https：//facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html。

Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue. We take advantage of a recently proposed speech-to-unit translation (S2UT) framework that encodes target speech into discrete representations, and transfer pre-training and efficient partial finetuning techniques that work well for speech-to-text translation (S2T) to the S2UT domain by studying both speech encoder and discrete unit decoder pre-training. Our experiments on Spanish-English translation show that self-supervised pre-training consistently improves model performance compared with multitask learning with an average 6.6-12.1 BLEU gain, and it can be further combined with data augmentation techniques that apply MT to create weakly supervised training data. Audio samples are available at: https://facebookresearch.github.io/speech_translation/enhanced_direct_s2st_units/index.html .

下载PDF全文

下载文献需遵守相关版权规定

论文标题