科体：预训练知识密集型语言任务的生成检索模型

论文标题

科体：预训练知识密集型语言任务的生成检索模型

CorpusBrain: Pre-train a Generative Retrieval Model for Knowledge-Intensive Language Tasks

论文作者

Chen, Jiangui, Zhang, Ruqing, Guo, Jiafeng, Liu, Yiqun, Fan, Yixing, Cheng, Xueqi

论文摘要

知识密集型语言任务（苏格兰信）通常需要大量信息来提供正确的答案。解决此问题的一种流行范式是将搜索系统与机器读取器相结合，前者检索支持证据，后者检查它们以产生答案。最近，读者组成部分在大规模的预培养生成模型的帮助下见证了重大进展。同时，搜索组件中的大多数现有解决方案都依赖于传统的``索引 - 重新解放''级管道，该管道遭受了巨大的内存足迹和端到端优化的困难。受到最新构建基于模型的IR模型的努力的启发，我们建议用一种新型的单步生成模型替换传统的多步搜索管道，该模型可以极大地简化搜索过程并以端到端的方式进行优化。我们表明，可以通过一组经过适当设计的预训练任务来学习强大的生成检索模型，并被采用以通过进一步的微调来改善各种下游苏格兰短裙任务。我们将预训练的生成检索模型命名为Copusbrain，因为有关该语料库的所有信息均以其参数进行编码，而无需构建其他索引。经验结果表明，在苏格兰语基准上的检索任务并建立了新的最新性能，Copusbrain可以显着优于强大的基准。我们还表明，在零水回设置和低资源设置下，科体班运行良好。

Knowledge-intensive language tasks (KILT) usually require a large body of information to provide correct answers. A popular paradigm to solve this problem is to combine a search system with a machine reader, where the former retrieves supporting evidences and the latter examines them to produce answers. Recently, the reader component has witnessed significant advances with the help of large-scale pre-trained generative models. Meanwhile most existing solutions in the search component rely on the traditional ``index-retrieve-then-rank'' pipeline, which suffers from large memory footprint and difficulty in end-to-end optimization. Inspired by recent efforts in constructing model-based IR models, we propose to replace the traditional multi-step search pipeline with a novel single-step generative model, which can dramatically simplify the search process and be optimized in an end-to-end manner. We show that a strong generative retrieval model can be learned with a set of adequately designed pre-training tasks, and be adopted to improve a variety of downstream KILT tasks with further fine-tuning. We name the pre-trained generative retrieval model as CorpusBrain as all information about the corpus is encoded in its parameters without the need of constructing additional index. Empirical results show that CorpusBrain can significantly outperform strong baselines for the retrieval task on the KILT benchmark and establish new state-of-the-art downstream performances. We also show that CorpusBrain works well under zero- and low-resource settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题