跨域自动术语提取的结合变压器

论文标题

跨域自动术语提取的结合变压器

Ensembling Transformers for Cross-domain Automatic Term Extraction

论文作者

Tran, Hanh Thi Hong, Martinc, Matej, Pelicon, Andraz, Doucet, Antoine, Pollak, Senja

论文摘要

自动术语提取在域语言理解和几种自然语言处理下游任务中起着至关重要的作用。在本文中，我们提出了一项关于基于变形金刚的预审前语言模型的预测能力的比较研究，以在多语言跨域设置中提取术语提取。除了评估单语模型提取单词和多词术语的能力外，我们还通过对不同语言模型的术语输出集进行交集或联合来实验单语和多语言模型的集合。我们的实验是在涵盖四个专业领域（腐败，风能，公平和心力衰竭）和三种语言（英语，法语和荷兰语）以及RSDO5 Slovenian语料库的Acter语料库上进行的。结果表明，采用单语模型的策略优于利用多语言模型的相关工作的最先进方法，除非荷兰语和法语，否则提取任务一词不包括命名实体条款的提取。此外，通过结合两个最佳性能模型的输出，我们实现了重大改进。

Automatic term extraction plays an essential role in domain language understanding and several natural language processing downstream tasks. In this paper, we propose a comparative study on the predictive power of Transformers-based pretrained language models toward term extraction in a multi-language cross-domain setting. Besides evaluating the ability of monolingual models to extract single- and multi-word terms, we also experiment with ensembles of mono- and multilingual models by conducting the intersection or union on the term output sets of different language models. Our experiments have been conducted on the ACTER corpus covering four specialized domains (Corruption, Wind energy, Equitation, and Heart failure) and three languages (English, French, and Dutch), and on the RSDO5 Slovenian corpus covering four additional domains (Biomechanics, Chemistry, Veterinary, and Linguistics). The results show that the strategy of employing monolingual models outperforms the state-of-the-art approaches from the related work leveraging multilingual models, regarding all the languages except Dutch and French if the term extraction task excludes the extraction of named entity terms. Furthermore, by combining the outputs of the two best performing models, we achieve significant improvements.

下载PDF全文

下载文献需遵守相关版权规定

论文标题