用于基于嵌入的大规模检索的预训练任务

论文标题

用于基于嵌入的大规模检索的预训练任务

Pre-training Tasks for Embedding-based Large-scale Retrieval

论文作者

Chang, Wei-Cheng, Yu, Felix X., Chang, Yin-Wen, Yang, Yiming, Kumar, Sanjiv

论文摘要

我们考虑大规模查询文件检索问题：给定查询（例如，一个问题），从大型文档语料库中返回相关文档集（例如，包含答案的段落）。这个问题通常通过两个步骤解决。检索阶段首先减少了解决方案空间，从而返回了候选文档的子集。然后，评分阶段将文档重新排列。至关重要的是，检索算法不仅需要高度召回，而且需要高效，及时返回候选人，及时归还文档数量。与最近由于跨注意模型的BERT风格的预训练任务而见证了重大进展的评分阶段不同，检索阶段的研究范围较低。大多数以前的作品都依赖经典信息检索（IR）方法，例如BM-25（令牌匹配 + TF-IDF权重）。这些模型仅接受稀疏的手工特征，无法针对感兴趣的不同下游任务进行优化。在本文中，我们对基于嵌入的检索模型进行了全面研究。我们表明，学习强大的基于嵌入的变压器模型的关键要素是一组预训练任务。通过精心设计的段落级预训练任务，变压器模型可以在广泛使用的BM-25以及没有变压器的嵌入模型上显着改善。我们研究的段落级预训练任务是逆披肩任务（ICT），身体首次选择（BFS），Wiki Link预测（WLP）和这三个组合的组合。

We consider the large-scale query-document retrieval problem: given a query (e.g., a question), return the set of relevant documents (e.g., paragraphs containing the answer) from a large document corpus. This problem is often solved in two steps. The retrieval phase first reduces the solution space, returning a subset of candidate documents. The scoring phase then re-ranks the documents. Critically, the retrieval algorithm not only desires high recall but also requires to be highly efficient, returning candidates in time sublinear to the number of documents. Unlike the scoring phase witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models, the retrieval phase remains less well studied. Most previous works rely on classic Information Retrieval (IR) methods such as BM-25 (token matching + TF-IDF weights). These models only accept sparse handcrafted features and can not be optimized for different downstream tasks of interest. In this paper, we conduct a comprehensive study on the embedding-based retrieval models. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paragraph-level pre-training tasks we studied are Inverse Cloze Task (ICT), Body First Selection (BFS), Wiki Link Prediction (WLP), and the combination of all three.

下载PDF全文

下载文献需遵守相关版权规定

论文标题