论文标题
对跨医学图像和报告的多模式表示学习的预训练的视觉和语言模型的比较
A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports
论文作者
论文摘要
从医学图像和相关上下文报告中提取的联合图像文本嵌入是大多数生物医学视觉和语言(V+L)任务的基石,包括医学视觉问题回答,临床图像文本检索,临床报告自动生成。在这项研究中,我们采用了四个预训练的V+L模型:LXMERT,Visualbert,Unier和Pixelbert,从模拟物CXR X光片和相关报告中学习多模式表示。 OpenI数据集上的外部评估表明,与开创性的CNN-RNN模型相比,预先训练的V+L模型学到的关节嵌入模型显示了胸腔发现分类任务的性能提高。我们进行了一项消融研究,以分析某些模型成分的贡献,并验证关节嵌入比仅文本嵌入的优势。我们还可以看到注意图,以说明V+L模型的注意机制。
Joint image-text embedding extracted from medical images and associated contextual reports is the bedrock for most biomedical vision-and-language (V+L) tasks, including medical visual question answering, clinical image-text retrieval, clinical report auto-generation. In this study, we adopt four pre-trained V+L models: LXMERT, VisualBERT, UNIER and PixelBERT to learn multimodal representation from MIMIC-CXR radiographs and associated reports. The extrinsic evaluation on OpenI dataset shows that in comparison to the pioneering CNN-RNN model, the joint embedding learned by pre-trained V+L models demonstrate performance improvement in the thoracic findings classification task. We conduct an ablation study to analyze the contribution of certain model components and validate the advantage of joint embedding over text-only embedding. We also visualize attention maps to illustrate the attention mechanism of V+L models.