论文标题
拉丁·伯特(Latin Bert):古典语言学的上下文语言模型
Latin BERT: A Contextual Language Model for Classical Philology
论文作者
论文摘要
我们介绍了拉丁语的拉丁语言模型,该语言是拉丁语的上下文语言模型,对跨越古典时代到21世纪的各种来源进行了6.427亿个单词的培训。在一系列案例研究中,我们说明了这种语言特异性模型的功能,既用于拉丁语的自然语言处理工作,又用于使用计算方法进行传统奖学金:我们表明,拉丁语Bert在所有三个通用的依赖性数据集中都可以在拉丁语数据集上进行宣传的新技术,并且可以用于预测缺失文本(包括关键的文本(包括关键的文本)(包括关键的文本)(包括关键的声明);我们创建了一个新的数据集,用于评估拉丁语的单词感觉歧义,并证明拉丁语Bert的表现优于静态单词嵌入。我们证明它可以通过查询上下文最近的邻居来用于语义信息搜索。我们公开发布了经过训练的模型,以帮助推动该领域的未来工作。
We present Latin BERT, a contextual language model for the Latin language, trained on 642.7 million words from a variety of sources spanning the Classical era to the 21st century. In a series of case studies, we illustrate the affordances of this language-specific model both for work in natural language processing for Latin and in using computational methods for traditional scholarship: we show that Latin BERT achieves a new state of the art for part-of-speech tagging on all three Universal Dependency datasets for Latin and can be used for predicting missing text (including critical emendations); we create a new dataset for assessing word sense disambiguation for Latin and demonstrate that Latin BERT outperforms static word embeddings; and we show that it can be used for semantically-informed search by querying contextual nearest neighbors. We publicly release trained models to help drive future work in this space.