论文标题
Sigmorphon 2022关于词素细分的共享任务
The SIGMORPHON 2022 Shared Task on Morpheme Segmentation
论文作者
论文摘要
Sigmorphon 2022关于语素分割的共享任务挑战了将单词分解为一系列词素的系统,并涵盖了大多数类型的形态:化合物,衍生和弯曲。子任务1,单词级的词素分段,用9种语言(捷克语,英语,西班牙语,匈牙利语,法语,意大利语,俄语,拉丁语,蒙古语)涵盖了500万个单词,并从7个团队中获得了13个系统提交,最佳系统和最佳系统平均为97.29%的F1分数97.29%的F1分数,在所有语言中得分为97.29%,英语(93.84%)(93.84%)and latin(93.84%)。子任务2,句子级的词素细分,以3种语言(捷克,英语,蒙古人)涵盖了18,735个句子,从3个团队中收到了10份系统提交,而最佳系统最佳系统的表现优于所有三种最先进的子词代酮化方法(BPE,ULM,Morm,Morfessor2)。为了促进错误分析并支持任何类型的未来研究,我们发布了所有系统预测,评估脚本和所有黄金标准数据集。
The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.