PHEMT：用于机器翻译的现象数据集鲁棒性在用户生成的内容上

论文标题

PHEMT：用于机器翻译的现象数据集鲁棒性在用户生成的内容上

PheMT: A Phenomenon-wise Dataset for Machine Translation Robustness on User-Generated Contents

论文作者

Fujii, Ryo, Mita, Masato, Abe, Kaori, Hanawa, Kazuaki, Morishita, Makoto, Suzuki, Jun, Inui, Kentaro

论文摘要

在翻译干净输入（例如新闻领域的文本）时，神经机器翻译（NMT）的质量已大大提高。但是，现有的研究表明，NMT仍在与某些类型的输入中挣扎，并具有相当大的噪音，例如Internet上用户生成的内容（UGC）。为了更好地利用NMT进行跨文化交流，最有希望的方向之一是开发一个正确处理这些表达式的模型。尽管它的重要性已得到认可，但尚不清楚是什么在清洁输入的翻译和UGC的翻译之间造成了巨大的差距。为了回答这个问题，我们提出了一个新的数据集Phemt，用于评估MT系统对日语 - 英语翻译中特定语言现象的鲁棒性。我们对创建数据集进行的实验表明，不仅我们的内部模型，甚至广泛使用的现成系统都受到某些现象的存在极大的打扰。

Neural Machine Translation (NMT) has shown drastic improvement in its quality when translating clean input, such as text from the news domain. However, existing studies suggest that NMT still struggles with certain kinds of input with considerable noise, such as User-Generated Contents (UGC) on the Internet. To make better use of NMT for cross-cultural communication, one of the most promising directions is to develop a model that correctly handles these expressions. Though its importance has been recognized, it is still not clear as to what creates the great gap in performance between the translation of clean input and that of UGC. To answer the question, we present a new dataset, PheMT, for evaluating the robustness of MT systems against specific linguistic phenomena in Japanese-English translation. Our experiments with the created dataset revealed that not only our in-house models but even widely used off-the-shelf systems are greatly disturbed by the presence of certain phenomena.

下载PDF全文

下载文献需遵守相关版权规定

论文标题