论文标题

学会发音是测量跨语性的联合拼字形式的复杂性

Learning to pronounce as measuring cross-lingual joint orthography-phonology complexity

论文作者

Rosati, Domenic

论文摘要

机器学习模型使我们能够通过展示每种语言中的任务的艰巨学习和表现良好来比较语言。在此调查之后,我们通过建模谱系 - phoneme(G2P)音译的任务来探索一种使语言“难以发音”的原因。通过培训角色级变压器模型,跨22种语言训练该任务,并衡量模型对其谱系和音素库存的熟练程度,我们表明某些特征出现了,这些特征在学习发音方面会分开更容易,更艰难的语言。即,语言从其拼字法中发音的复杂性是由于其谱系映射的表达或简单性。进一步的讨论说明了未来的研究应如何考虑每个语言的相对数据稀疏性来设计更公平的跨语性比较任务。

Machine learning models allow us to compare languages by showing how hard a task in each language might be to learn and perform well on. Following this line of investigation, we explore what makes a language "hard to pronounce" by modelling the task of grapheme-to-phoneme (g2p) transliteration. By training a character-level transformer model on this task across 22 languages and measuring the model's proficiency against its grapheme and phoneme inventories, we show that certain characteristics emerge that separate easier and harder languages with respect to learning to pronounce. Namely the complexity of a language's pronunciation from its orthography is due to the expressive or simplicity of its grapheme-to-phoneme mapping. Further discussion illustrates how future studies should consider relative data sparsity per language to design fairer cross-lingual comparison tasks.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源