论文标题
从对西班牙语模型的评估中学到的经验教训
Lessons learned from the evaluation of Spanish Language Models
论文作者
论文摘要
鉴于语言模型对自然语言处理领域的影响,已经培训了许多仅西班牙语编码的掩盖语言模型(又名BERTS)。这些模型是在大型项目中使用非常大的私人语料库开发的,或者是通过较小规模的学术努力来利用可自由使用的数据开发的。在本文中,我们介绍了西班牙语语言模型的全面面对面比较与以下结果:(i)先前忽略了来自大型公司的多语言模型,而不是单语模型,这实质上改变了西班牙语模型的评估景观; (ii)整个单语模型的结果不是结论性的,据称较小且较低的模型在竞争中表现较小。基于这些经验结果,我们主张需要更多的研究来了解它们的因素。从这个意义上讲,需要进一步研究语料库规模,质量和预培训技术的效果,以便能够比大型私人公司发布的多语言模型要好得多,特别是面对该领域的快速发展。欢迎西班牙语技术发展的最新活动,但我们的结果表明,建立语言模型仍然是一个开放的,资源丰富的问题,这需要将资源(货币和/或计算)与最佳的研究专业知识和实践结合起来。
Given the impact of language models on the field of Natural Language Processing, a number of Spanish encoder-only masked language models (aka BERTs) have been trained and released. These models were developed either within large projects using very large private corpora or by means of smaller scale academic efforts leveraging freely available data. In this paper we present a comprehensive head-to-head comparison of language models for Spanish with the following results: (i) Previously ignored multilingual models from large companies fare better than monolingual models, substantially changing the evaluation landscape of language models in Spanish; (ii) Results across the monolingual models are not conclusive, with supposedly smaller and inferior models performing competitively. Based on these empirical results, we argue for the need of more research to understand the factors underlying them. In this sense, the effect of corpus size, quality and pre-training techniques need to be further investigated to be able to obtain Spanish monolingual models significantly better than the multilingual ones released by large private companies, specially in the face of rapid ongoing progress in the field. The recent activity in the development of language technology for Spanish is to be welcomed, but our results show that building language models remains an open, resource-heavy problem which requires to marry resources (monetary and/or computational) with the best research expertise and practice.