新加坡语言的标点符号修复：英语，马来语和普通话

论文标题

新加坡语言的标点符号修复：英语，马来语和普通话

Punctuation Restoration for Singaporean Spoken Languages: English, Malay, and Mandarin

论文作者

Rao, Abhinav, Thi-Nga, Ho, Eng-Siong, Chng

论文摘要

本文介绍了恢复由多语言ASR系统生成的ASR成绩单的标点符号的工作。焦点语言是英语，普通话和马来语，是新加坡最受欢迎的三种语言。据我们所知，这是第一个可以同时解决这三种语言的标点符号修复的系统。传统方法通常将任务视为一项顺序标记任务，但是，这项工作采用了一种插槽填充方法，可以预测每个单词边界处的标点符号的存在和类型。该方法类似于BERT预训练阶段中采用的掩盖语言模型方法，但我们的模型没有预测蒙版的单词，而是预测了掩盖的标点符号。此外，我们发现使用jieba1而不仅仅是使用XLM-R的内置句子令牌，可以显着提高标点的普通话成绩单的性能。英语和普通话IWSLT2022数据集和马来新闻的实验结果表明，拟议的方法获得了普通话的最新结果，同时保持了73.8％的F1得分，同时维持了英语和马来语的合理F1得分，即分别为74.7％和78％。我们的源代码允许重现结果并构建用于演示目的的简单基于Web的应用程序。

This paper presents the work of restoring punctuation for ASR transcripts generated by multilingual ASR systems. The focus languages are English, Mandarin, and Malay which are three of the most popular languages in Singapore. To the best of our knowledge, this is the first system that can tackle punctuation restoration for these three languages simultaneously. Traditional approaches usually treat the task as a sequential labeling task, however, this work adopts a slot-filling approach that predicts the presence and type of punctuation marks at each word boundary. The approach is similar to the Masked-Language Model approach employed during the pre-training stages of BERT, but instead of predicting the masked word, our model predicts masked punctuation. Additionally, we find that using Jieba1 instead of only using the built-in SentencePiece tokenizer of XLM-R can significantly improve the performance of punctuating Mandarin transcripts. Experimental results on English and Mandarin IWSLT2022 datasets and Malay News show that the proposed approach achieved state-of-the-art results for Mandarin with 73.8% F1-score while maintaining a reasonable F1-score for English and Malay, i.e. 74.7% and 78% respectively. Our source code that allows reproducing the results and building a simple web-based application for demonstration purposes is available on Github.

下载PDF全文

下载文献需遵守相关版权规定

论文标题