论文标题
您对什么刻?密集检索作为词汇的分布
What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary
论文作者
论文摘要
双重编码器现在是密集检索的主要建筑。但是,我们对它们如何代表文本以及为什么会导致良好的表现有很少的了解。在这项工作中,我们通过词汇上的分布来阐明这个问题。我们建议通过将双重编码器将其投影到模型的词汇空间中来解释矢量表示。我们表明,由此产生的预测包含丰富的语义信息,并在它们之间建立联系和稀疏检索。我们发现,这种观点可以为某些致密检索器的故障案例提供解释。例如,我们观察到,模型处理尾部实体的无能与令牌分布的趋势相关,以忘记这些实体的某些令牌。我们利用这种洞察力,并提出了一种简单的方法,可以在推理时使用词汇信息丰富查询和段落表示,并表明与零拍设置中的原始模型相比,这显着提高了性能,并特别是在Beir Benchmark上。
Dual encoders are now the dominant architecture for dense retrieval. Yet, we have little understanding of how they represent text, and why this leads to good performance. In this work, we shed light on this question via distributions over the vocabulary. We propose to interpret the vector representations produced by dual encoders by projecting them into the model's vocabulary space. We show that the resulting projections contain rich semantic information, and draw connection between them and sparse retrieval. We find that this view can offer an explanation for some of the failure cases of dense retrievers. For example, we observe that the inability of models to handle tail entities is correlated with a tendency of the token distributions to forget some of the tokens of those entities. We leverage this insight and propose a simple way to enrich query and passage representations with lexical information at inference time, and show that this significantly improves performance compared to the original model in zero-shot settings, and specifically on the BEIR benchmark.