论文标题
探测神经语言模型中的选区结构
Probing for Constituency Structure in Neural Language Models
论文作者
论文摘要
在本文中,我们研究了上下文神经语言模型(LMS)隐式学习句法结构。更具体地说,我们专注于宾夕法尼亚州立树库(PTB)中代表的组成结构。使用基于诊断分类器的标准探测技术,我们评估表示LM(例如Roberta)神经元激活中不同类别成分的准确性。为了确保我们的探针专注于句法知识,而不是隐含的语义概括,我们还尝试了PTB版本,该版本是通过彼此随机替换成分而获得的,同时保持语法结构,即PTB的语法构成不当,但语法上构成了良好的PTB版本。我们发现,即使在操纵数据上,有4个预估计的转发LMS也可以在我们的探测任务上获得高性能,这表明其表示中的语义和句法知识可以分开,并且实际上是由LM学到的选区信息。此外,我们表明可以将完整的选区树与LM表示线性分离。
In this paper, we investigate to which extent contextual neural language models (LMs) implicitly learn syntactic structure. More concretely, we focus on constituent structure as represented in the Penn Treebank (PTB). Using standard probing techniques based on diagnostic classifiers, we assess the accuracy of representing constituents of different categories within the neuron activations of a LM such as RoBERTa. In order to make sure that our probe focuses on syntactic knowledge and not on implicit semantic generalizations, we also experiment on a PTB version that is obtained by randomly replacing constituents with each other while keeping syntactic structure, i.e., a semantically ill-formed but syntactically well-formed version of the PTB. We find that 4 pretrained transfomer LMs obtain high performance on our probing tasks even on manipulated data, suggesting that semantic and syntactic knowledge in their representations can be separated and that constituency information is in fact learned by the LM. Moreover, we show that a complete constituency tree can be linearly separated from LM representations.