多语言模型可以转移到看不见的方言吗？关于北非阿拉伯齐的案例研究

论文标题

多语言模型可以转移到看不见的方言吗？关于北非阿拉伯齐的案例研究

Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

论文作者

Muller, Benjamin, Sagot, Benoit, Seddah, Djamé

论文摘要

为非标准化和低资源语言建立自然语言处理系统是一个困难的挑战。大规模多语言审慎的语言模型的最新成功提供了解决此问题的新建模工具。在这项工作中，我们研究了多语言语言模型处理看不见的方言的能力。我们将用户生成的北非阿拉伯语作为案例研究，这是一种贫困的阿拉伯语，频繁用法语代码混合，并用阿拉伯语编写，这是阿拉伯语到拉丁文脚本的非标准化音译。专注于两项任务，词性标记和依赖解析，我们以零拍和无监督的适应场景显示，多语言语言模型能够转移到这种看不见的方言中，特别是在两种极端情况下，尤其是跨脚本：跨脚本，跨越现代的阿拉伯语语言，以及（ii）从远处相关的语言中，在derece of theSeense中，nam nam nam nam nam nam nam nam nam nam nam。我们的结果构成了该方言上的第一个成功的转移实验，因此为开发NLP生态系统开发用于资源扫描，非标准化和高度可变的白话语言的方式是铺路的。

Building natural language processing systems for non standardized and low resource languages is a difficult challenge. The recent success of large-scale multilingual pretrained language models provides new modeling tools to tackle this. In this work, we study the ability of multilingual language models to process an unseen dialect. We take user generated North-African Arabic as our case study, a resource-poor dialectal variety of Arabic with frequent code-mixing with French and written in Arabizi, a non-standardized transliteration of Arabic to Latin script. Focusing on two tasks, part-of-speech tagging and dependency parsing, we show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect, specifically in two extreme cases: (i) across scripts, using Modern Standard Arabic as a source language, and (ii) from a distantly related language, unseen during pretraining, namely Maltese. Our results constitute the first successful transfer experiments on this dialect, paving thus the way for the development of an NLP ecosystem for resource-scarce, non-standardized and highly variable vernacular languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题