超声检查的自我监督对比视频语音表示学习

论文标题

超声检查的自我监督对比视频语音表示学习

Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound

论文作者

Jiao, Jianbo, Cai, Yifan, Alsharid, Mohammad, Drukker, Lior, Papageorghiou, Aris T., Noble, J. Alison

论文摘要

在医学成像中，手动注释可获取，有时甚至是无法访问的昂贵，从而使传统的基于深度学习的模型难以扩展。结果，如果可以从原始数据中得出有用的表示，而无需手动注释，那将是有益的。在本文中，我们建议通过多模式超声视频语音数据来解决自我监督表示学习的问题。在这种情况下，我们假设超声视频与超声波的相应叙事音频之间存在很高的相关性。为了学习有意义的表示形式，该模型需要确定这种相关性，同时了解基本的解剖特征。我们设计了一个框架来对视频和音频之间的对应关系进行建模，而无需任何人类注释。在此框架内，我们引入了跨模式对比度学习和亲和力感知的自定进度学习方案，以增强相关建模。对多模式胎儿超声视频和音频的实验评估表明，所提出的方法能够学习强大的表示，并可以很好地转移到标准平面检测和眼睛凝视预测的下游任务。

In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access, making conventional deep learning-based models difficult to scale. As a result, it would be beneficial if useful representations could be derived from raw data without the need for manual annotations. In this paper, we propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data. For this case, we assume that there is a high correlation between the ultrasound video and the corresponding narrative speech audio of the sonographer. In order to learn meaningful representations, the model needs to identify such correlation and at the same time understand the underlying anatomical features. We designed a framework to model the correspondence between video and audio without any kind of human annotations. Within this framework, we introduce cross-modal contrastive learning and an affinity-aware self-paced learning scheme to enhance correlation modelling. Experimental evaluations on multi-modal fetal ultrasound video and audio show that the proposed approach is able to learn strong representations and transfers well to downstream tasks of standard plane detection and eye-gaze prediction.

下载PDF全文

下载文献需遵守相关版权规定

论文标题