论文标题

Sigmorphon 2022关于词素细分的共享任务

The SIGMORPHON 2022 Shared Task on Morpheme Segmentation

论文作者

Batsuren, Khuyagbaatar, Bella, Gábor, Arora, Aryaman, Martinović, Viktor, Gorman, Kyle, Žabokrtský, Zdeněk, Ganbold, Amarsanaa, Dohnalová, Šárka, Ševčíková, Magda, Pelegrinová, Kateřina, Giunchiglia, Fausto, Cotterell, Ryan, Vylomova, Ekaterina

论文摘要

Sigmorphon 2022关于语素分割的共享任务挑战了将单词分解为一系列词素的系统,并涵盖了大多数类型的形态:化合物,衍生和弯曲。子任务1,单词级的词素分段,用9种语言(捷克语,英语,西班牙语,匈牙利语,法语,意大利语,俄语,拉丁语,蒙古语)涵盖了500万个单词,并从7个团队中获得了13个系统提交,最佳系统和最佳系统平均为97.29%的F1分数97.29%的F1分数,在所有语言中得分为97.29%,英语(93.84%)(93.84%)and latin(93.84%)。子任务2,句子级的词素细分,以3种语言(捷克,英语,蒙古人)涵盖了18,735个句子,从3个团队中收到了10份系统提交,而最佳系统最佳系统的表现优于所有三种最先进的子词代酮化方法(BPE,ULM,Morm,Morfessor2)。为了促进错误分析并支持任何类型的未来研究,我们发布了所有系统预测,评估脚本和所有黄金标准数据集。

The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源