论文标题
端到端语音翻译的自适应功能选择
Adaptive Feature Selection for End-to-End Speech Translation
论文作者
论文摘要
语音信号中的信息不会均匀分布,这是端到端(E2E)语音翻译(ST)的额外挑战,可以学会专注于内容丰富的功能。在本文中,我们提出了基于编码器的E2E ST的自适应特征选择(AFS)。我们首先预先培训ASR编码器,然后应用AFS,以动态估计每个编码语音功能对SR的重要性。一个堆叠在ASR编码器顶部的ST编码器,然后接收(冷冻)ASR编码器的过滤特征。我们将L0Drop(Zhang等,2020)作为AFS的骨干,并使其适应有关时间和特征维度的稀疏语音特征。 LibrisPeech En-FR和Reser-C基准的结果表明,AFS通过修剪约84%的时间特征来促进对ST的学习,从而产生〜1.3-1.6 BLEU的平均翻译增益,而解码速度约为1.4倍。特别是,与级联基线相比,AFS减少了性能差距,并以18.56的BLEU分数在Librispeech en-fr上胜过它(无数据增强)
Information in speech signals is not evenly distributed, making it an additional challenge for end-to-end (E2E) speech translation (ST) to learn to focus on informative features. In this paper, we propose adaptive feature selection (AFS) for encoder-decoder based E2E ST. We first pre-train an ASR encoder and apply AFS to dynamically estimate the importance of each encoded speech feature to SR. A ST encoder, stacked on top of the ASR encoder, then receives the filtered features from the (frozen) ASR encoder. We take L0DROP (Zhang et al., 2020) as the backbone for AFS, and adapt it to sparsify speech features with respect to both temporal and feature dimensions. Results on LibriSpeech En-Fr and MuST-C benchmarks show that AFS facilitates learning of ST by pruning out ~84% temporal features, yielding an average translation gain of ~1.3-1.6 BLEU and a decoding speedup of ~1.4x. In particular, AFS reduces the performance gap compared to the cascade baseline, and outperforms it on LibriSpeech En-Fr with a BLEU score of 18.56 (without data augmentation)