论文标题

重新思考手术字幕:使用补丁的基于端到端的基于窗口的MLP变压器

Rethinking Surgical Captioning: End-to-End Window-Based MLP Transformer Using Patches

论文作者

Xu, Mengya, Islam, Mobarakol, Ren, Hongliang

论文摘要

手术字幕在手术指导预测和报告生成中起着重要作用。但是,大多数字幕模型仍然依赖重型计算对象检测器或特征提取器来提取区域特征。此外,检测模型需要其他边界框注释,这是昂贵且需要熟练的注释器的框架注释。这些导致推断延迟,并限制字幕模型在实时机器人手术中部署。为此,我们通过利用基于贴片的移位窗口技术来设计端到端检测器和功能无提取器字幕模型。我们建议以更快的推理速度和更少的计算,建议基于窗口的多层感知器变压器字幕(Swinmlp-trancap)。 SwinMLP-Trancap用基于窗口的多头MLP代替了多头注意模块。这样的部署主要集中于图像理解任务,但是很少有作品研究字幕生成任务。 Swinmlp-trancap还扩展到视频版本,用于使用3D补丁和Windows的视频字幕任务。与以前的基于检测器或基于特征提取器的模型相比,我们的模型在维护两个手术数据集上的性能的同时,大大简化了体系结构设计。该代码可在https://github.com/xumengyaamy/swinmlp_trancap上公开获取。

Surgical captioning plays an important role in surgical instruction prediction and report generation. However, the majority of captioning models still rely on the heavy computational object detector or feature extractor to extract regional features. In addition, the detection model requires additional bounding box annotation which is costly and needs skilled annotators. These lead to inference delay and limit the captioning model to deploy in real-time robotic surgery. For this purpose, we design an end-to-end detector and feature extractor-free captioning model by utilizing the patch-based shifted window technique. We propose Shifted Window-Based Multi-Layer Perceptrons Transformer Captioning model (SwinMLP-TranCAP) with faster inference speed and less computation. SwinMLP-TranCAP replaces the multi-head attention module with window-based multi-head MLP. Such deployments primarily focus on image understanding tasks, but very few works investigate the caption generation task. SwinMLP-TranCAP is also extended into a video version for video captioning tasks using 3D patches and windows. Compared with previous detector-based or feature extractor-based models, our models greatly simplify the architecture design while maintaining performance on two surgical datasets. The code is publicly available at https://github.com/XuMengyaAmy/SwinMLP_TranCAP.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源