论文标题
部分可观测时空混沌系统的无模型预测
Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer
论文作者
论文摘要
手术中的视觉问题回答(VQA)在很大程度上没有探索。专家外科医生很少,经常被临床和学术工作负载超负荷。这种超负荷通常会限制他们从患者,医学生或初级居民与外科手术程序有关的时间回答问卷。有时,学生和初级居民也不要在课堂上提出太多问题以减少干扰。尽管已经提供了计算机辅助的模拟器和过去的手术程序记录以供他们观察和提高技能,但他们仍然非常依靠医学专家来回答他们的问题。将手术VQA系统作为可靠的“第二意见”可以作为备份,并减轻医疗专家回答这些问题的负担。缺乏注释的医学数据和特定于域的术语的存在限制了对手术程序的VQA探索。在这项工作中,我们设计了一项手术VQA任务,该任务根据外科手术场景回答有关手术程序的问卷。扩展MICCAI内窥镜视觉挑战2018数据集和工作流识别数据集,我们介绍了两个具有分类和基于句子的答案的手术-VQA数据集。为了执行手术VQA,我们采用视觉文本变形金刚模型。我们进一步介绍了一个基于MLP的剩余visualbert编码器模型,该模型可以在视觉令牌和文本令牌之间实施相互作用,从而提高了基于分类的答案的性能。此外,我们研究了输入图像贴片数量和时间视觉特征对分类和基于句子的答案中模型性能的影响。
Visual question answering (VQA) in surgery is largely unexplored. Expert surgeons are scarce and are often overloaded with clinical and academic workloads. This overload often limits their time answering questionnaires from patients, medical students or junior residents related to surgical procedures. At times, students and junior residents also refrain from asking too many questions during classes to reduce disruption. While computer-aided simulators and recording of past surgical procedures have been made available for them to observe and improve their skills, they still hugely rely on medical experts to answer their questions. Having a Surgical-VQA system as a reliable 'second opinion' could act as a backup and ease the load on the medical experts in answering these questions. The lack of annotated medical data and the presence of domain-specific terms has limited the exploration of VQA for surgical procedures. In this work, we design a Surgical-VQA task that answers questionnaires on surgical procedures based on the surgical scene. Extending the MICCAI endoscopic vision challenge 2018 dataset and workflow recognition dataset further, we introduce two Surgical-VQA datasets with classification and sentence-based answers. To perform Surgical-VQA, we employ vision-text transformers models. We further introduce a residual MLP-based VisualBert encoder model that enforces interaction between visual and text tokens, improving performance in classification-based answering. Furthermore, we study the influence of the number of input image patches and temporal visual features on the model performance in both classification and sentence-based answering.