双向编码器的句法结构蒸馏预处理

论文标题

双向编码器的句法结构蒸馏预处理

Syntactic Structure Distillation Pretraining For Bidirectional Encoders

论文作者

Kuncoro, Adhiguna, Kong, Lingpeng, Fried, Daniel, Yogatama, Dani, Rimell, Laura, Dyer, Chris, Blunsom, Phil

论文摘要

经过大量数据培训的文本表示学习者在下游任务上取得了显着的成功。有趣的是，他们在句法能力的挑战性测试中也表现出色。鉴于这一成功，这仍然是一个悬而未决的问题，即像伯特这样的可扩展学习者是否仅凭数据量表就可以完全精通自然语言的语法，或者他们是否仍然从更明确的语法偏见中受益。为了回答这个问题，我们引入了一种知识蒸馏策略，用于通过对层次结构的句法信息进行提炼，以将句法偏见注入BERT预处理 - 尽管更难扩展---句法语言模型。由于BERT模型在双向上下文中掩盖了单词，因此我们建议从语法LM中的上下文中提取近似的边际分布。我们的方法在各种结构化预测任务上将相对误差降低了2-21％，尽管我们在胶水基准上获得了混杂的结果。我们的发现证明了句法偏见的好处，即使是在代表学习者中剥削大量数据的人，并有助于更好地理解句法偏见在何处对自然语言理解的基准最有帮助的。

Textual representation learners trained on large amounts of data have achieved notable success on downstream tasks; intriguingly, they have also performed well on challenging tests of syntactic competence. Given this success, it remains an open question whether scalable learners like BERT can become fully proficient in the syntax of natural language by virtue of data scale alone, or whether they still benefit from more explicit syntactic biases. To answer this question, we introduce a knowledge distillation strategy for injecting syntactic biases into BERT pretraining, by distilling the syntactically informative predictions of a hierarchical---albeit harder to scale---syntactic language model. Since BERT models masked words in bidirectional context, we propose to distill the approximate marginal distribution over words in context from the syntactic LM. Our approach reduces relative error by 2-21% on a diverse set of structured prediction tasks, although we obtain mixed results on the GLUE benchmark. Our findings demonstrate the benefits of syntactic biases, even in representation learners that exploit large amounts of data, and contribute to a better understanding of where syntactic biases are most helpful in benchmarks of natural language understanding.

下载PDF全文

下载文献需遵守相关版权规定

论文标题