论文标题
关于噪音在人群数据中的影响以进行语音翻译
On the Impact of Noises in Crowd-Sourced Data for Speech Translation
论文作者
论文摘要
培训语音翻译(ST)模型需要大量和高质量的数据集。必须使用最广泛使用的ST基准数据集之一。它包含大约400个小时的语音转录翻译数据,可为八个翻译说明中的每个数据提供。该数据集在创建过程中通过了几个质量控制过滤器。但是,我们发现必须遇到三个主要质量问题:音频文本未对准,不准确的翻译和不必要的说话者的名字。这些数据质量问题对模型开发和评估有什么影响?在本文中,我们提出了一种自动方法,以英语 - 德语(EN-DE)翻译为例,以解决或过滤上述质量问题。我们的实验表明,ST模型在干净的测试集上的表现更好,并且在不同的测试集中提出的模型的等级保持一致。此外,简单地从训练集中删除未对准的数据点并不会导致更好的ST模型。
Training speech translation (ST) models requires large and high-quality datasets. MuST-C is one of the most widely used ST benchmark datasets. It contains around 400 hours of speech-transcript-translation data for each of the eight translation directions. This dataset passes several quality-control filters during creation. However, we find that MuST-C still suffers from three major quality issues: audio-text misalignment, inaccurate translation, and unnecessary speaker's name. What are the impacts of these data quality issues for model development and evaluation? In this paper, we propose an automatic method to fix or filter the above quality issues, using English-German (En-De) translation as an example. Our experiments show that ST models perform better on clean test sets, and the rank of proposed models remains consistent across different test sets. Besides, simply removing misaligned data points from the training set does not lead to a better ST model.