Hinglishnorm-印度英语代码的语料库混合句子用于文本标准化

论文标题

Hinglishnorm-印度英语代码的语料库混合句子用于文本标准化

hinglishNorm -- A Corpus of Hindi-English Code Mixed Sentences for Text Normalization

论文作者

Makhija, Piyush, Kumar, Ankit, Gupta, Anuj

论文摘要

我们提出Hinglishnorm-用于文本标准化任务的印度英语代码混合句子的人类注释语料库。语料库中的每个句子都与其相应的人类注释的归一化形式保持一致。据我们所知，没有公开可用的文本标准化任务的印度英语代码混合句子的语料库。我们的工作是朝这个方向上的第一次尝试。该语料库包含13494个平行段。此外，我们介绍了该语料库的基线归一化结果。我们获得了15.55的单词错误率（WER），双语评估研究（BLEU）得分为71.2，用于评估以明确订购（Meteor）得分为0.50的度量。

We present hinglishNorm -- a human annotated corpus of Hindi-English code-mixed sentences for text normalization task. Each sentence in the corpus is aligned to its corresponding human annotated normalized form. To the best of our knowledge, there is no corpus of Hindi-English code-mixed sentences for text normalization task that is publicly available. Our work is the first attempt in this direction. The corpus contains 13494 parallel segments. Further, we present baseline normalization results on this corpus. We obtain a Word Error Rate (WER) of 15.55, BiLingual Evaluation Understudy (BLEU) score of 71.2, and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of 0.50.

下载PDF全文

下载文献需遵守相关版权规定

论文标题