无监督的生物医学问题的预培训

论文标题

无监督的生物医学问题的预培训

Unsupervised Pre-training for Biomedical Question Answering

论文作者

Kommaraju, Vaishnavi, Gunasekaran, Karthick, Li, Kun, Bansal, Trapit, McCallum, Andrew, Williams, Ivana, Istrate, Ana-Maria

论文摘要

我们探讨了对生物医学文本的无监督表示方法的适用性-Biobert，Scibert和Biosentvec - 对生物医学问题的回答。为了进一步改善生物医学质量保证的无监督表示形式，我们引入了一项新的预训练任务，这些任务是从旨在推理有关生物医学实体的未标记数据。我们的预训练方法包括通过随机替换一个随机实体提及的生物医学实体来损坏给定上下文，然后用正确的实体提及查询模型以找到上下文的损坏部分。这项拖延任务使该模型能够从丰富的，未标记的生物医学文本中学习良好的表示，这些文本有助于质量检查任务，并最大程度地减少训练任务与下游QA任务之间的火车测试不匹配，通过要求模型预测跨度。我们的实验表明，对拟议的预训练任务进行的预训练生物明显提高了性能，并超越了第七bioasq任务7B期B挑战的先前最佳模型。

We explore the suitability of unsupervised representation learning methods on biomedical text -- BioBERT, SciBERT, and BioSentVec -- for biomedical question answering. To further improve unsupervised representations for biomedical QA, we introduce a new pre-training task from unlabeled data designed to reason about biomedical entities in the context. Our pre-training method consists of corrupting a given context by randomly replacing some mention of a biomedical entity with a random entity mention and then querying the model with the correct entity mention in order to locate the corrupted part of the context. This de-noising task enables the model to learn good representations from abundant, unlabeled biomedical text that helps QA tasks and minimizes the train-test mismatch between the pre-training task and the downstream QA tasks by requiring the model to predict spans. Our experiments show that pre-training BioBERT on the proposed pre-training task significantly boosts performance and outperforms the previous best model from the 7th BioASQ Task 7b-Phase B challenge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题