基于上下文的波兰语语言

论文标题

基于上下文的波兰语语言

Context based lemmatizer for Polish language

论文作者

Karwatowski, Michal, Pietron, Marcin

论文摘要

诱饵是将单词的易位形式分组在一起的过程，因此可以将它们分析为单个项目，该项目由单词的引理或字典形式识别。在计算语言学中，Lemmatisation是根据单词的预期含义来确定单词的引理的算法过程。与茎不同，lemmatisation取决于正确地识别句子中单词的语音和含义的预期部分，以及在该句子周围的较大上下文中。结果，开发有效的Lemmatisation算法是复杂的任务。近年来，可以观察到，用于此任务的深度学习模型优于包括机器学习算法在内的其他方法。在本文中，提出了基于Google T5模型的波兰lemmatizer。培训的上下文长度不同。该模型可以实现波兰语言捕捉过程的最佳结果。

Lemmatization is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. In computational linguistics, lemmatisation is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatisation depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence. As a result, developing efficient lemmatisation algorithm is the complex task. In recent years it can be observed that deep learning models used for this task outperform other methods including machine learning algorithms. In this paper the polish lemmatizer based on Google T5 model is presented. The training was run with different context lengths. The model achieves the best results for polish language lemmatisation process.

下载PDF全文

下载文献需遵守相关版权规定

论文标题