逐渐预估计的始终域问题回答的密集索引指数

论文标题

逐渐预估计的始终域问题回答的密集索引指数

Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering

论文作者

Xiong, Wenhan, Wang, Hong, Wang, William Yang

论文摘要

为了从大型语料库中提取答案，开放域问题答案（QA）系统通常依靠信息检索（IR）技术来缩小搜索空间。标准倒置指数方法（例如TF-IDF）通常被用作其效率。但是，他们的检索性能受到限制，因为它们仅使用浅且稀疏的词汇特征。为了打破IR瓶颈，最近的研究表明，通过在有效的段落编码器中绘制段落段落为密集的矢量，可以实现更强的检索性能。经过训练后，可以将语料库预编码为低维矢量，并存储在索引结构中，在该结构中可以有效实现检索作为最大内部产品搜索。尽管结果有令人鼓舞，但预处理这样的致密指数很昂贵，而且通常需要很大的批量尺寸。在这项工作中，我们提出了一种简单且资源有效的方法，以预先段落编码器。首先，我们没有使用启示性创建的伪质段对进行预处理，而是利用现有的预审预周化的序列模型来构建一个强大的问题生成器，从而创建高质量的预处理数据。其次，我们提出了一种进行性预处理算法，以确保每批有效的负面样品存在。在三个数据集中，我们的方法优于现有的密集检索方法，该方法使用了7倍的计算资源进行预处理。

To extract answers from a large corpus, open-domain question answering (QA) systems usually rely on information retrieval (IR) techniques to narrow the search space. Standard inverted index methods such as TF-IDF are commonly used as thanks to their efficiency. However, their retrieval performance is limited as they simply use shallow and sparse lexical features. To break the IR bottleneck, recent studies show that stronger retrieval performance can be achieved by pretraining a effective paragraph encoder that index paragraphs into dense vectors. Once trained, the corpus can be pre-encoded into low-dimensional vectors and stored within an index structure where the retrieval can be efficiently implemented as maximum inner product search. Despite the promising results, pretraining such a dense index is expensive and often requires a very large batch size. In this work, we propose a simple and resource-efficient method to pretrain the paragraph encoder. First, instead of using heuristically created pseudo question-paragraph pairs for pretraining, we utilize an existing pretrained sequence-to-sequence model to build a strong question generator that creates high-quality pretraining data. Second, we propose a progressive pretraining algorithm to ensure the existence of effective negative samples in each batch. Across three datasets, our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.

下载PDF全文

下载文献需遵守相关版权规定

论文标题