以查询为中心的生物医学和互联-19复杂问题回答的提取性摘要

论文标题

以查询为中心的生物医学和互联-19复杂问题回答的提取性摘要

Query-focused Extractive Summarisation for Biomedical and COVID-19 Complex Question Answering

论文作者

Mollá, Diego

论文摘要

本文介绍了麦格理大学的参与（截至2022年6月），以及bioasq10任务〜b（bioasq10b），阶段〜b。在这些任务中，有望参与系统对生物医学问题产生复杂的答案，其中答案可能包含多个句子。我们采用以查询为中心的提取性摘要技术。特别是，我们遵循一种基于句子分类的方法，该方法得分与问题相关的每个候选句子，并将$ n $得分的句子返回为答案。协同任务对应于需要文档选择，摘要选择并找到最终答案的端到端系统，但培训数据非常有限。对于协同任务，我们选择了两个阶段的候选句子：文档检索和摘要检索，最终答案是通过使用BioASQ9B培训数据培训的Distilbert/Albert分类器来找到的。使用BioASQ组织者提供的搜索API，作为对CORD-19数据的标准搜索进行了文件检索，并使用问题和候选句子的余弦相似性来重新将最高记录的文档的句子重新排列来实现摘要检索。我们观察到通过Sbert代表的向量在TF.IDF上具有优势。 BIOASQ10B B期专注于寻找生物医学问题的具体答案。对于此任务，我们遵循以数据为中心的方法。我们假设最初的BioASQ年的训练数据可能会偏差，并且我们尝试了不同的培训数据子集。我们观察到在BioASQ10B培训数据的后半部分训练系统时，结果有所改善。

This paper presents Macquarie University's participation to the two most recent BioASQ Synergy Tasks (as per June 2022), and to the BioASQ10 Task~B (BioASQ10b), Phase~B. In these tasks, participating systems are expected to generate complex answers to biomedical questions, where the answers may contain more than one sentence. We apply query-focused extractive summarisation techniques. In particular, we follow a sentence classification-based approach that scores each candidate sentence associated to a question, and the $n$ highest-scoring sentences are returned as the answer. The Synergy Task corresponds to an end-to-end system that requires document selection, snippet selection, and finding the final answer, but it has very limited training data. For the Synergy task, we selected the candidate sentences following two phases: document retrieval and snippet retrieval, and the final answer was found by using a DistilBERT/ALBERT classifier that had been trained on the training data of BioASQ9b. Document retrieval was achieved as a standard search over the CORD-19 data using the search API provided by the BioASQ organisers, and snippet retrieval was achieved by re-ranking the sentences of the top retrieved documents, using the cosine similarity of the question and candidate sentence. We observed that vectors represented via sBERT have an edge over tf.idf. BioASQ10b Phase B focuses on finding the specific answers to biomedical questions. For this task, we followed a data-centric approach. We hypothesised that the training data of the first BioASQ years might be biased and we experimented with different subsets of the training data. We observed an improvement of results when the system was trained on the second half of the BioASQ10b training data.

下载PDF全文

下载文献需遵守相关版权规定

论文标题