神经多合成语言建模

论文标题

神经多合成语言建模

Neural Polysynthetic Language Modelling

论文作者

Schwartz, Lane, Tyers, Francis, Levin, Lori, Kirov, Christo, Littell, Patrick, Lo, Chi-kiu, Prud'hommeaux, Emily, Park, Hyunji Hayley, Steimel, Kenneth, Knowles, Rebecca, Micher, Jeffrey, Strunk, Lonny, Liu, Han, Haley, Coleman, Zhang, Katherine J., Jimmerson, Robbie, Andriyanets, Vasilisa, Muis, Aldrian Obaja, Otani, Naoki, Park, Jong Hyuk, Zhang, Zhisong

论文摘要

自然语言处理中的研究通常假定对英语和其他广泛使用语言有效的方法是“语言不可知论”。在高资源语言中，尤其是那些分析性的语言中，一种常见的方法是将共同根部的形态赋予变体视为完全独立的单词类型。假设，每个根的形态弯曲有限，并且大多数将出现在足够大的语料库中，以便模型可以充分了解每种形式的统计数据。当这些假设中的任何一个都不成立时，通常会使用诸如Stemming，Lemmatization或子单词细分之类的方法，尤其是在综合语言（如西班牙语或俄罗斯）的情况下，具有比英语更多的拐点。在文献中，芬兰语或土耳其语之类的语言被视为挑战常见建模假设的极端典范。然而，在考虑世界上所有的语言时，芬兰和土耳其语更接近平均情况。当我们考虑多合成语言（在形态复杂性的极端）时，诸如茎，诱饵或子词建模之类的方法可能不够。这些语言具有很高的Hapax Legomena，表明需要对单词进行适当的形态学处理，否则，模型无法捕获足够的单词统计信息。我们研究了四种多功能语言的语言建模，机器翻译和文本预测的当前最新技术：瓜拉尼，圣劳伦斯岛Yupik，阿拉斯加Yupik中部和Inuktitut。然后，我们为语言建模提出了一个新颖的框架，该框架将有限状态形态分析仪与张量产品表示的知识表示结合在一起，以便能够处理能够处理各种类型的变体语言的神经语言模型。

Research in natural language processing commonly assumes that approaches that work well for English and and other widely-used languages are "language agnostic". In high-resource languages, especially those that are analytic, a common approach is to treat morphologically-distinct variants of a common root as completely independent word types. This assumes, that there are limited morphological inflections per root, and that the majority will appear in a large enough corpus, so that the model can adequately learn statistics about each form. Approaches like stemming, lemmatization, or subword segmentation are often used when either of those assumptions do not hold, particularly in the case of synthetic languages like Spanish or Russian that have more inflection than English. In the literature, languages like Finnish or Turkish are held up as extreme examples of complexity that challenge common modelling assumptions. Yet, when considering all of the world's languages, Finnish and Turkish are closer to the average case. When we consider polysynthetic languages (those at the extreme of morphological complexity), approaches like stemming, lemmatization, or subword modelling may not suffice. These languages have very high numbers of hapax legomena, showing the need for appropriate morphological handling of words, without which it is not possible for a model to capture enough word statistics. We examine the current state-of-the-art in language modelling, machine translation, and text prediction for four polysynthetic languages: Guaraní, St. Lawrence Island Yupik, Central Alaskan Yupik, and Inuktitut. We then propose a novel framework for language modelling that combines knowledge representations from finite-state morphological analyzers with Tensor Product Representations in order to enable neural language models capable of handling the full range of typologically variant languages.

下载PDF全文

下载文献需遵守相关版权规定

论文标题