论文标题
在瑞典国家图书馆玩文字 - 做瑞典伯特
Playing with Words at the National Library of Sweden -- Making a Swedish BERT
论文作者
论文摘要
本文介绍了由KBLAB在瑞典国家图书馆(KB)开发的瑞典Bert(“ KB-Bert”)。我们基于为除英语以外的其他语言创建基于变压器的BERT模型的最新努力,我们解释了如何使用KB的集合来创建和培训针对瑞典语的新型BERT模型。我们还与现有模型相比,我们还介绍了模型的结果 - 主要是由瑞典公共就业服务,Arbetsförmedlingen和Google的多语言M-Bert制作的模型 - 我们证明,KB-Bert在命名实体认可(NER)到of-Speech标记的一系列NLP任务中表现出了这些任务。我们的讨论突出了鉴于缺乏较小语言(如瑞典语)的培训数据和测试床,继续存在的困难。我们在此处发布我们的模型以进行进一步探索和研究:https://github.com/kungbib/swedish-bert-models。
This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLab for data-driven research at the National Library of Sweden (KB). Building on recent efforts to create transformer-based BERT models for languages other than English, we explain how we used KB's collections to create and train a new language-specific BERT model for Swedish. We also present the results of our model in comparison with existing models - chiefly that produced by the Swedish Public Employment Service, Arbetsförmedlingen, and Google's multilingual M-BERT - where we demonstrate that KB-BERT outperforms these in a range of NLP tasks from named entity recognition (NER) to part-of-speech tagging (POS). Our discussion highlights the difficulties that continue to exist given the lack of training data and testbeds for smaller languages like Swedish. We release our model for further exploration and research here: https://github.com/Kungbib/swedish-bert-models .