论文标题
自动回归搜索引擎:作为文档标识符生成子字符串
Autoregressive Search Engines: Generating Substrings as Document Identifiers
论文作者
论文摘要
知识密集型语言任务要求NLP系统在给定语料库中提供正确的答案并为其检索支持证据。自回归语言模型正在成为生成答案的事实上的标准,并以惊人的速度出现了更新,更强大的系统。在本文中,我们认为,所有这些(以及未来)的进步可以直接应用于检索问题,而对模型的体系结构的最小干预。先前的工作已经探索了将搜索空间划分为层次结构的方法,并通过自动添加生成其唯一标识符来检索文档。在这项工作中,我们提出了一种不强制搜索空间中任何结构的替代方案:将段落中的所有ngrams用作其可能的标识符。这种设置使我们可以使用自回归模型生成和评分独特的NGrams,然后将其映射到通过有效的数据结构的完整段落。从经验上讲,我们不仅表明这不仅要优于先前的自回归方法,而且还导致平均提高至少10点,比在苏格兰语基准测试的更确定的检索解决方案中取得更高的检索解决方案,在某些数据集上建立了一些相当较轻的记忆足迹,而不是竞争性的系统,则在苏格兰语基准测试上进行了级别的检索。 https://github.com/facebookresearch/seal上的代码和预训练模型。
Knowledge-intensive language tasks require NLP systems to both provide the correct answer and retrieve supporting evidence for it in a given corpus. Autoregressive language models are emerging as the de-facto standard for generating answers, with newer and more powerful systems emerging at an astonishing pace. In this paper we argue that all this (and future) progress can be directly applied to the retrieval problem with minimal intervention to the models' architecture. Previous work has explored ways to partition the search space into hierarchical structures and retrieve documents by autoregressively generating their unique identifier. In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers. This setup allows us to use an autoregressive model to generate and score distinctive ngrams, that are then mapped to full passages through an efficient data structure. Empirically, we show this not only outperforms prior autoregressive approaches but also leads to an average improvement of at least 10 points over more established retrieval solutions for passage-level retrieval on the KILT benchmark, establishing new state-of-the-art downstream performance on some datasets, while using a considerably lighter memory footprint than competing systems. Code and pre-trained models at https://github.com/facebookresearch/SEAL.