MACRONY：用于多语言和多域首字母缩写提取的大规模数据集

论文标题

MACRONY：用于多语言和多域首字母缩写提取的大规模数据集

MACRONYM: A Large-Scale Dataset for Multilingual and Multi-Domain Acronym Extraction

论文作者

Veyseh, Amir Pouran Ben, Meister, Nicole, Yoon, Seunghyun, Jain, Rajiv, Dernoncourt, Franck, Nguyen, Thien Huu

论文摘要

首字母缩写提取是在各种NLP应用程序中识别首字母缩写及其扩展形式的任务。尽管近年来这项任务取得了重大进展，但现有AE研究的一个局限性是它们仅限于英语和某些领域（即科学和生物医学）。因此，在其他语言和领域中AE的挑战主要是未开发的。缺乏多种语言和领域的带注释的数据集是阻碍该领域研究的主要问题。为了解决此限制，我们为多语言多域AE提供了一个新的数据集。具体而言，为AE手动注释了6种类型上不同的语言和2个域，即法律和科学的27,200句。我们对拟议数据集进行的广泛实验表明，不同语言和不同学习设置的AE面临着独特的挑战，强调了对多语言和多域AE进行进一步研究的必要性。

Acronym extraction is the task of identifying acronyms and their expanded forms in texts that is necessary for various NLP applications. Despite major progress for this task in recent years, one limitation of existing AE research is that they are limited to the English language and certain domains (i.e., scientific and biomedical). As such, challenges of AE in other languages and domains is mainly unexplored. Lacking annotated datasets in multiple languages and domains has been a major issue to hinder research in this area. To address this limitation, we propose a new dataset for multilingual multi-domain AE. Specifically, 27,200 sentences in 6 typologically different languages and 2 domains, i.e., Legal and Scientific, is manually annotated for AE. Our extensive experiments on the proposed dataset show that AE in different languages and different learning settings has unique challenges, emphasizing the necessity of further research on multilingual and multi-domain AE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题