学习语义解析的上下文表示，并通过生成启动的预训练

论文标题

学习语义解析的上下文表示，并通过生成启动的预训练

Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

论文作者

Shi, Peng, Ng, Patrick, Wang, Zhiguo, Zhu, Henghui, Li, Alexander Hanbo, Wang, Jun, Santos, Cicero Nogueira dos, Xiang, Bing

论文摘要

最近，通过利用大规模文本语料库来培训具有自我监督的学习目标（例如蒙版语言模型（MLM）），对学习各种NLP任务的上下文表示具有重大兴趣。但是，基于一项试点研究，当将它们应用于文本到SQL语义解析器时，我们会观察到现有通用语言模型的三个问题：无法检测到列表中的列提及，无法从单元格值中推断列提及，并且无法构成复杂的SQL查询。为了减轻这些问题，我们提出了一个模型预训练框架，即生成增强的预训练（GAP），该框架共同学习了通过利用生成模型生成预训练数据的自然语言话语和表格模式的表示。 GAP模型经过2m tustrance-Schema对和30k Tusterance-Schema-Sql三元组的训练，其发音是由生成模型产生的。基于实验结果，利用差距模型作为表示编码器的神经语义解析器获得了蜘蛛和标准至SQL基准的新最新结果。

Most recently, there has been significant interest in learning contextual representations for various NLP tasks, by leveraging large scale text corpora to train large neural language models with self-supervised learning objectives, such as Masked Language Model (MLM). However, based on a pilot study, we observe three issues of existing general-purpose language models when they are applied to text-to-SQL semantic parsers: fail to detect column mentions in the utterances, fail to infer column mentions from cell values, and fail to compose complex SQL queries. To mitigate these issues, we present a model pre-training framework, Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data. GAP MODEL is trained on 2M utterance-schema pairs and 30K utterance-schema-SQL triples, whose utterances are produced by generative models. Based on experimental results, neural semantic parsers that leverage GAP MODEL as a representation encoder obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-SQL benchmarks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题