开发具有多回答和多重点问题的提取性临床问题回答数据集

论文标题

开发具有多回答和多重点问题的提取性临床问题回答数据集

Development of an Extractive Clinical Question Answering Dataset with Multi-Answer and Multi-Focus Questions

论文作者

Moon, Sungrim, He, Huan, Liu, Hongfang, Fan, Jungwei W.

论文摘要

背景：提取问题避开（EQA）是一种有用的自然语言处理（NLP）申请，用于通过在其临床注释中找到答案来回答患者特定问题。现实的临床EQA可以在一个问题中对单个问题和多个重点点有多个答案，而这些问题在现有的数据集中缺乏用于人工智能解决方案的开发。目的：创建一个数据集，用于开发和评估可以处理自然多回答和多聚焦问题的临床EQA系统。方法：我们利用了2018年国家NLP临床挑战（N2C2）语料库的注释关系来生成EQA数据集。具体而言，包括1-to-n，M-1-1和M到N的药物关系，以形成多回答和多对焦质量质量质量质量质量药品，除了基本的单毒一级案例外，它们还代表了更复杂和自然的挑战。开发了基线解决方案并在数据集上进行了测试。结果：派生的RXWHYQA数据集包含96,939个QA条目。在可回答的问题中，有25％需要多个答案，有2％的人在一个问题中询问多种药物。经常在文本中的答案周围观察到，其中90％的药物和原因术语出现在相同或相邻的句子内。基线EQA解决方案在整个数据集上达到了0.72的最佳F1量度，并且在特定的子集上，在无法回答的问题上达到：0.93，单盘问题上的0.48在单盘问题上为0.60，而在多重毒品问题上为0.60，在多毒品问题上，在单人问题上，在单次问题上，在多 - 答案问题上为0.43，在单一问题上进行了0.43。讨论：RXWHYQA数据集可用于培训和评估需要处理多回答和多聚焦问题的系统。具体而言，多回答EQA似乎具有挑战性，因此值得对研究进行更多的投资。

Background: Extractive question-answering (EQA) is a useful natural language processing (NLP) application for answering patient-specific questions by locating answers in their clinical notes. Realistic clinical EQA can have multiple answers to a single question and multiple focus points in one question, which are lacking in the existing datasets for development of artificial intelligence solutions. Objective: Create a dataset for developing and evaluating clinical EQA systems that can handle natural multi-answer and multi-focus questions. Methods: We leveraged the annotated relations from the 2018 National NLP Clinical Challenges (n2c2) corpus to generate an EQA dataset. Specifically, the 1-to-N, M-to-1, and M-to-N drug-reason relations were included to form the multi-answer and multi-focus QA entries, which represent more complex and natural challenges in addition to the basic one-drug-one-reason cases. A baseline solution was developed and tested on the dataset. Results: The derived RxWhyQA dataset contains 96,939 QA entries. Among the answerable questions, 25% require multiple answers, and 2% ask about multiple drugs within one question. There are frequent cues observed around the answers in the text, and 90% of the drug and reason terms occur within the same or an adjacent sentence. The baseline EQA solution achieved a best f1-measure of 0.72 on the entire dataset, and on specific subsets, it was: 0.93 on the unanswerable questions, 0.48 on single-drug questions versus 0.60 on multi-drug questions, 0.54 on the single-answer questions versus 0.43 on multi-answer questions. Discussion: The RxWhyQA dataset can be used to train and evaluate systems that need to handle multi-answer and multi-focus questions. Specifically, multi-answer EQA appears to be challenging and therefore warrants more investment in research.

下载PDF全文

下载文献需遵守相关版权规定

论文标题