论文标题

通过神经检索器改善生物医学信息检索

Improving Biomedical Information Retrieval with Neural Retrievers

论文作者

Luo, Man, Mitra, Arindam, Gokhale, Tejas, Baral, Chitta

论文摘要

信息检索(IR)对于搜索引擎和对话系统以及自然语言处理任务(例如开放域问题回答)至关重要。 IR在生物医学领域中起重要功能,在生物医学领域中,科学知识的内容和来源可能会迅速发展。尽管在标准的开放域问题回答任务中,神经检索器已经超过了传统的IR方法,例如TF-IDF和BM25,但仍发现它们在生物医学领域缺乏。在本文中,我们试图使用生物医学领域中的神经检索器(NR)改善信息检索(IR),并使用三方面的方法实现此目标。首先,为了解决生物医学领域中数据的相对缺乏,我们提出了一种基于模板的问题生成方法,该方法可以利用来训练神经疗程模型。其次,我们开发了两个新颖的预训练任务,这些任务与信息检索的下游任务密切相符。第三,我们介绍了``poly-dpr''模型,该模型将每个上下文编码为多个上下文向量。对BioASQ挑战的广泛实验和分析表明,我们提出的方法会导致对现有的神经方法的巨大收益,并在小型场景中击败BM25。我们表明,BM25和我们的方法可以相互补充,而简单的混合模型会在大型语料库设置中进一步提高。

Information retrieval (IR) is essential in search engines and dialogue systems as well as natural language processing tasks such as open-domain question answering. IR serve an important function in the biomedical domain, where content and sources of scientific knowledge may evolve rapidly. Although neural retrievers have surpassed traditional IR approaches such as TF-IDF and BM25 in standard open-domain question answering tasks, they are still found lacking in the biomedical domain. In this paper, we seek to improve information retrieval (IR) using neural retrievers (NR) in the biomedical domain, and achieve this goal using a three-pronged approach. First, to tackle the relative lack of data in the biomedical domain, we propose a template-based question generation method that can be leveraged to train neural retriever models. Second, we develop two novel pre-training tasks that are closely aligned to the downstream task of information retrieval. Third, we introduce the ``Poly-DPR'' model which encodes each context into multiple context vectors. Extensive experiments and analysis on the BioASQ challenge suggest that our proposed method leads to large gains over existing neural approaches and beats BM25 in the small-corpus setting. We show that BM25 and our method can complement each other, and a simple hybrid model leads to further gains in the large corpus setting.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源