朝着零甲骨文错误率在总机上的单词错误率

论文标题

朝着零甲骨文错误率在总机上的单词错误率

Toward Zero Oracle Word Error Rate on the Switchboard Benchmark

论文作者

Faria, Arlo, Janin, Adam, Riedhammer, Korbinian, Adkoli, Sidhi

论文摘要

“总机基准”是自动语音识别（ASR）研究中众所周知的测试集，为声称人级转录精度的系统建立了记录性能。这项工作突出了此评估的鲜为人知的实际考虑，这表明了单词错误率（WER）的重大提高，通过纠正参考转录并偏离官方评分方法。在这个更详细和可再现的方案中，即使是商业ASR系统也可以评分低于5％，并且研究系统的既定记录降低至2.3％。提出了一个替代的成绩单精度指标，该指标不会惩罚缺失，并且似乎对人类与机器性能更有区别。尽管商业ASR系统仍低于此阈值，但研究系统被证明可以清楚地超过商业人类言语识别的准确性。这项工作还使用标准化的评分工具来探讨通过在替代方案列表中选择最佳的计算Oracle WER。将短语替代表示形式与话语级n-最佳列表和单词级数据结构进行了比较。使用密集的晶格并添加量量的单词，这使Oracle wer含量为0.18％。

The "Switchboard benchmark" is a very well-known test set in automatic speech recognition (ASR) research, establishing record-setting performance for systems that claim human-level transcription accuracy. This work highlights lesser-known practical considerations of this evaluation, demonstrating major improvements in word error rate (WER) by correcting the reference transcriptions and deviating from the official scoring methodology. In this more detailed and reproducible scheme, even commercial ASR systems can score below 5% WER and the established record for a research system is lowered to 2.3%. An alternative metric of transcript precision is proposed, which does not penalize deletions and appears to be more discriminating for human vs. machine performance. While commercial ASR systems are still below this threshold, a research system is shown to clearly surpass the accuracy of commercial human speech recognition. This work also explores using standardized scoring tools to compute oracle WER by selecting the best among a list of alternatives. A phrase alternatives representation is compared to utterance-level N-best lists and word-level data structures; using dense lattices and adding out-of-vocabulary words, this achieves an oracle WER of 0.18%.

下载PDF全文

下载文献需遵守相关版权规定

论文标题