论文标题
DynamicRetriever:既不稀疏也不致密指数的基于训练的IR系统
DynamicRetriever: A Pre-training Model-based IR System with Neither Sparse nor Dense Index
论文作者
论文摘要
Web搜索为人们提供了一种获取信息的有前途的方式,并且已经进行了广泛的研究。随着深度学习和大规模训练技术的外科手术,提出了各种神经信息检索模型,它们已经证明了改善搜索(尤其是排名)质量的能力。所有这些现有的搜索方法都遵循一个常见的范式,即索引 - retrieve-rerank,在那里他们首先根据文档术语(即稀疏的倒置索引)或表示向量(即密度向量索引)(即密度矢量索引)构建所有文档的索引,然后检索和rerank检索文档,基于基于相似性的查询和文档之间的相似性。在本文中,我们探索了新的信息检索范式,既不稀疏也不致密索引,而只是模型。具体而言,我们提出了一种基于训练模型的IR系统,称为DynamicRetriever。至于此系统,训练阶段将语料库的令牌级别和文档级信息(尤其是文档标识符)嵌入到模型参数中,然后推理阶段直接生成给定查询的文档标识符。与现有的搜索方法相比,基于模型的IR系统具有两个优点:i)使用预训练模型参数化传统的静态索引,该模型将文档的语义映射转换为动态和更新的过程; ii)使用单独的文档标识符,它捕获了每个文档的术语级别和文档级信息。在公共搜索基准MARCO上进行的广泛实验验证了我们提出的新范式进行信息检索的有效性和潜力。
Web search provides a promising way for people to obtain information and has been extensively studied. With the surgence of deep learning and large-scale pre-training techniques, various neural information retrieval models are proposed and they have demonstrated the power for improving search (especially, the ranking) quality. All these existing search methods follow a common paradigm, i.e. index-retrieve-rerank, where they first build an index of all documents based on document terms (i.e., sparse inverted index) or representation vectors (i.e., dense vector index), then retrieve and rerank retrieved documents based on similarity between the query and documents via ranking models. In this paper, we explore a new paradigm of information retrieval with neither sparse nor dense index but only a model. Specifically, we propose a pre-training model-based IR system called DynamicRetriever. As for this system, the training stage embeds the token-level and document-level information (especially, document identifiers) of the corpus into the model parameters, then the inference stage directly generates document identifiers for a given query. Compared with existing search methods, the model-based IR system has two advantages: i) it parameterizes the traditional static index with a pre-training model, which converts the document semantic mapping into a dynamic and updatable process; ii) with separate document identifiers, it captures both the term-level and document-level information for each document. Extensive experiments conducted on the public search benchmark MS MARCO verify the effectiveness and potential of our proposed new paradigm for information retrieval.