Spade：使用双文档编码器改善稀疏表示形式以进行第一阶段检索

论文标题

Spade：使用双文档编码器改善稀疏表示形式以进行第一阶段检索

SpaDE: Improving Sparse Representations using a Dual Document Encoder for First-stage Retrieval

论文作者

Choi, Eunseong, Lee, Sunkyung, Choi, Minjin, Ko, Hyeseon, Song, Young-In, Lee, Jongwuk

论文摘要

稀疏的文档表示形式已被广泛用于通过精确的词汇匹配来检索相关文档。由于预先计算的倒置索引，它支持快速的临时搜索，但会引起词汇不匹配的问题。尽管最近使用预训练语言模型的神经排名模型可以解决此问题，但它们通常需要昂贵的查询推理成本，这意味着有效性和效率之间的权衡。在解决权衡方面，我们提出了一种新颖的Uni-Andoder排名模型，使用双文档编码器（Spade），通过双重编码器学习文档表示形式。每个编码器在（i）调整术语以改善词汇匹配的重要性和（ii）扩展其他术语以支持语义匹配的重要性。此外，我们的共同训练策略可以有效地训练双重编码器，并避免不必要的干预措施相互训练。几个基准测试的实验结果表明，Spade的表现优于现有的Uni-Uni-Anopoder排名模型。

Sparse document representations have been widely used to retrieve relevant documents via exact lexical matching. Owing to the pre-computed inverted index, it supports fast ad-hoc search but incurs the vocabulary mismatch problem. Although recent neural ranking models using pre-trained language models can address this problem, they usually require expensive query inference costs, implying the trade-off between effectiveness and efficiency. Tackling the trade-off, we propose a novel uni-encoder ranking model, Sparse retriever using a Dual document Encoder (SpaDE), learning document representation via the dual encoder. Each encoder plays a central role in (i) adjusting the importance of terms to improve lexical matching and (ii) expanding additional terms to support semantic matching. Furthermore, our co-training strategy trains the dual encoder effectively and avoids unnecessary intervention in training each other. Experimental results on several benchmarks show that SpaDE outperforms existing uni-encoder ranking models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题