论文标题
单词意义歧义的语言模型的分析和评估
Analysis and Evaluation of Language Models for Word Sense Disambiguation
论文作者
论文摘要
基于变压器的语言模型已在NLP中占据了许多领域。伯特及其衍生工具主导了大多数现有的评估基准,包括词觉差异(WSD),这要归功于它们捕获上下文敏感的语义细微差别的能力。但是,关于它们的能力和在编码和恢复单词感官中的能力和潜在局限性仍然几乎没有知识。在本文中,我们就词汇模棱两可的著名BERT模型提供了深入的定量和定性分析。我们分析的主要结论之一是,即使每个单词的感觉都有有限的示例,伯特也可以准确地捕获高级感觉区分。我们的分析还表明,在某些情况下,就培训数据和计算资源的可用性而言,语言模型几乎在理想条件下解决了粗粒的名词歧义。但是,这种情况很少发生在现实世界中,因此,即使在粗粒的环境中,许多实际挑战仍然存在。我们还对两个基于语言模型的WSD策略(即微调和特征提取)进行了深入的比较,发现后一种方法在感官偏见方面更加可靠,并且可以更好地利用有限的可用培训数据。实际上,即使每个单词sense只使用三个训练句子,通过仅使用三个训练句子,通过增加此培训数据的大小而获得的最小改进也证明了平均上下文化嵌入的简单特征提取策略也证明了强劲的。
Transformer-based language models have taken many fields in NLP by storm. BERT and its derivatives dominate most of the existing evaluation benchmarks, including those for Word Sense Disambiguation (WSD), thanks to their ability in capturing context-sensitive semantic nuances. However, there is still little knowledge about their capabilities and potential limitations in encoding and recovering word senses. In this article, we provide an in-depth quantitative and qualitative analysis of the celebrated BERT model with respect to lexical ambiguity. One of the main conclusions of our analysis is that BERT can accurately capture high-level sense distinctions, even when a limited number of examples is available for each word sense. Our analysis also reveals that in some cases language models come close to solving coarse-grained noun disambiguation under ideal conditions in terms of availability of training data and computing resources. However, this scenario rarely occurs in real-world settings and, hence, many practical challenges remain even in the coarse-grained setting. We also perform an in-depth comparison of the two main language model based WSD strategies, i.e., fine-tuning and feature extraction, finding that the latter approach is more robust with respect to sense bias and it can better exploit limited available training data. In fact, the simple feature extraction strategy of averaging contextualized embeddings proves robust even using only three training sentences per word sense, with minimal improvements obtained by increasing the size of this training data.