论文标题
单词相似性预测,独立语言的象征化竞争对手特定语言的象征化
Language-Independent Tokenisation Rivals Language-Specific Tokenisation for Word Similarity Prediction
论文作者
论文摘要
不需要标记的语言资源或词典的语言独立的象征化方法(LIT)方法由于其在资源贫乏语言中的适用性而获得了普及。此外,它们使用固定尺寸的词汇表代表一种语言,并且可以有效地处理看不见或稀有的单词。另一方面,特定于语言的令牌化(LST)方法具有悠久而既定的历史,并且是使用精心创建的词典和培训资源开发的。与通过LIT方法产生的微动物不同,LST方法产生有效的形态子词。尽管LIT与LST方法之间的权衡取舍了对比,但它们在下游NLP任务上的性能仍不清楚。在本文中,我们从经验上比较了使用语义相似性测量的两种方法,作为一组多种语言的评估任务。我们的实验结果涵盖了八种语言,表明,当词汇大小较大时,LST始终优于在许多语言中,在词汇大小很大时会产生可比较或更好的结果,而在许多语言中,词汇量相对较小(即少于100k的单词)词汇大小,鼓励在语言特定资源不可用的情况下使用LIT的使用,而不是较小的模型。此外,我们发现平滑的逆频率(SIF)是一种准确的方法,可以从子单词嵌入中创建单词嵌入,以用于多语言语义相似性预测任务。对代币最近邻居的进一步分析表明,语义和句法相关的令牌紧密嵌入子单词嵌入空间中
Language-independent tokenisation (LIT) methods that do not require labelled language resources or lexicons have recently gained popularity because of their applicability in resource-poor languages. Moreover, they compactly represent a language using a fixed size vocabulary and can efficiently handle unseen or rare words. On the other hand, language-specific tokenisation (LST) methods have a long and established history, and are developed using carefully created lexicons and training resources. Unlike subtokens produced by LIT methods, LST methods produce valid morphological subwords. Despite the contrasting trade-offs between LIT vs. LST methods, their performance on downstream NLP tasks remain unclear. In this paper, we empirically compare the two approaches using semantic similarity measurement as an evaluation task across a diverse set of languages. Our experimental results covering eight languages show that LST consistently outperforms LIT when the vocabulary size is large, but LIT can produce comparable or better results than LST in many languages with comparatively smaller (i.e. less than 100K words) vocabulary sizes, encouraging the use of LIT when language-specific resources are unavailable, incomplete or a smaller model is required. Moreover, we find that smoothed inverse frequency (SIF) to be an accurate method to create word embeddings from subword embeddings for multilingual semantic similarity prediction tasks. Further analysis of the nearest neighbours of tokens show that semantically and syntactically related tokens are closely embedded in subword embedding spaces