论文标题

t-procodjoction:序列标签任务的高质量注释投影

T-Projection: High Quality Annotation Projection for Sequence Labeling Tasks

论文作者

García-Ferrero, Iker, Agerri, Rodrigo, Rigau, German

论文摘要

在没有适当可用的标记数据的数据和语言的标记数据的情况下,已提出注释投影是自动生成带注释数据的可能策略之一。注释投影通常被表达为在平行语料库上运输与源语言中给定跨度有关的标签,以目标语言为相应的跨度。在本文中,我们介绍了T-Procottion,这是一种新颖的注释投影方法,它利用了大量预验证的文本到文本语言模型和最新的机器翻译技术。 T-Projoction将标签投影任务分解为两个子任务:(i)候选生成步骤,其中生成了使用多语言T5模型的一组投影候选者,并且(ii)候选选择步骤,其中基于训练概率对生成的候选者进行排名。我们对5种印欧语和8种低资源的非洲语言进行了有关内在和外在任务的实验。我们将T-Procotity的表现优于以前的注射投影方法,以广泛的缘故。我们认为,T-provention可以帮助自动减轻缺乏用于序列标记任务的高质量培训数据。代码和数据公开可用。

In the absence of readily available labeled data for a given sequence labeling task and language, annotation projection has been proposed as one of the possible strategies to automatically generate annotated data. Annotation projection has often been formulated as the task of transporting, on parallel corpora, the labels pertaining to a given span in the source language into its corresponding span in the target language. In this paper we present T-Projection, a novel approach for annotation projection that leverages large pretrained text-to-text language models and state-of-the-art machine translation technology. T-Projection decomposes the label projection task into two subtasks: (i) A candidate generation step, in which a set of projection candidates using a multilingual T5 model is generated and, (ii) a candidate selection step, in which the generated candidates are ranked based on translation probabilities. We conducted experiments on intrinsic and extrinsic tasks in 5 Indo-European and 8 low-resource African languages. We demostrate that T-projection outperforms previous annotation projection methods by a wide margin. We believe that T-Projection can help to automatically alleviate the lack of high-quality training data for sequence labeling tasks. Code and data are publicly available.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源