使用评估者和系统元数据来解释VoiceMos挑战2022数据集中的差异

论文标题

使用评估者和系统元数据来解释VoiceMos挑战2022数据集中的差异

Using Rater and System Metadata to Explain Variance in the VoiceMOS Challenge 2022 Dataset

论文作者

Chinen, Michael, Skoglund, Jan, Reddy, Chandan K A, Ragano, Alessandro, Hines, Andrew

论文摘要

非参考语音质量模型对于越来越多的应用程序很重要。 VoiceMos 2022挑战提供了带有主观标签的合成语音转换和文本到语音样本的数据集。这项研究着眼于在元数据的主观语音质量和数据集的分布不平衡的主观评级中可以解释的差异。使用WAV2VEC 2.0构建语音质量模型，具有其他元数据功能，其中包括评估者组和系统标识符，并获得了竞争性指标，包括Spearman等级相关系数（SRCC）为0.934，MSE为0.088在系统级别，以及0.877和0.198，在Tuersance-Level处于0.877和0.198。使用数据限制或盲目的数据和元数据进一步改善了指标。元数据分析表明，系统级指标并不代表模型的系统级预测，这是由于验证和测试数据集中每个系统所使用的话语的差异很大。我们得出的结论是，通常，条件应该在测试集中具有足够的话语来绑定样本平均误差，并且在系统之间的话语数量中相对平衡，否则话语级指标可能更可靠和可解释。

Non-reference speech quality models are important for a growing number of applications. The VoiceMOS 2022 challenge provided a dataset of synthetic voice conversion and text-to-speech samples with subjective labels. This study looks at the amount of variance that can be explained in subjective ratings of speech quality from metadata and the distribution imbalances of the dataset. Speech quality models were constructed using wav2vec 2.0 with additional metadata features that included rater groups and system identifiers and obtained competitive metrics including a Spearman rank correlation coefficient (SRCC) of 0.934 and MSE of 0.088 at the system-level, and 0.877 and 0.198 at the utterance-level. Using data and metadata that the test restricted or blinded further improved the metrics. A metadata analysis showed that the system-level metrics do not represent the model's system-level prediction as a result of the wide variation in the number of utterances used for each system on the validation and test datasets. We conclude that, in general, conditions should have enough utterances in the test set to bound the sample mean error, and be relatively balanced in utterance count between systems, otherwise the utterance-level metrics may be more reliable and interpretable.

下载PDF全文

下载文献需遵守相关版权规定

论文标题