用于低资源历史语言的开发跨拨号金语法：迈向现代斯拉夫的通用解析器

论文标题

用于低资源历史语言的开发跨拨号金语法：迈向现代斯拉夫的通用解析器

Exploiting Cross-Dialectal Gold Syntax for Low-Resource Historical Languages: Towards a Generic Parser for Pre-Modern Slavic

论文作者

Pedrazzini, Nilo

论文摘要

本文探讨了通过对来自不同相关品种的数据训练专业解析器的性能来提高专业解析器的性能。由于其语言异质性，前现代的斯拉夫品种被视为低资源的历史语言，因此可以利用交叉拨号直肠树库数据来克服数据稀缺性并尝试培训各种局际分析器。讨论了先前对斯拉夫依赖性解析的实验，尤其是关于它们应对不同拼字法，区域和风格特征的能力。使用JPTDP（Nguyen＆verspoor 2018）培训了一个通用的前现代的斯拉夫解析器和两个专门的解析器，一种是东斯拉夫人，另一种用于南斯拉夫式的解析器，这是一种神经网络模型，一种神经网络模型，用于主语音（POS）标记和依赖性分析和依赖性分析，显示出对Universal依赖（Universal Edistence（Universal）依赖性（包括）老年人（UD）（UD BARKS）的有希望的结果（UD Banks），包括（UD）（UD）（UD）（UD）（UD）（UD Banks）。通过这些实验，可以为OC（83.79 \％未标记的附件得分（UAS）和78.43 \％标记的附件得分（LAS））和Old East Slavic（OES）（OES）（85.7 \％\％UAS和80.16 \％LAS）获得新的最新技术状态。

This paper explores the possibility of improving the performance of specialized parsers for pre-modern Slavic by training them on data from different related varieties. Because of their linguistic heterogeneity, pre-modern Slavic varieties are treated as low-resource historical languages, whereby cross-dialectal treebank data may be exploited to overcome data scarcity and attempt the training of a variety-agnostic parser. Previous experiments on early Slavic dependency parsing are discussed, particularly with regard to their ability to tackle different orthographic, regional and stylistic features. A generic pre-modern Slavic parser and two specialized parsers -- one for East Slavic and one for South Slavic -- are trained using jPTDP (Nguyen & Verspoor 2018), a neural network model for joint part-of-speech (POS) tagging and dependency parsing which had shown promising results on a number of Universal Dependency (UD) treebanks, including Old Church Slavonic (OCS). With these experiments, a new state of the art is obtained for both OCS (83.79\% unlabelled attachment score (UAS) and 78.43\% labelled attachement score (LAS)) and Old East Slavic (OES) (85.7\% UAS and 80.16\% LAS).

下载PDF全文

下载文献需遵守相关版权规定

论文标题