输出：一个基于统一变压器的图像融合框架使用自我监督的学习

论文标题

输出：一个基于统一变压器的图像融合框架使用自我监督的学习

TransFuse: A Unified Transformer-based Image Fusion Framework using Self-supervised Learning

论文作者

Qu, Linhao, Liu, Shaolei, Wang, Manning, Li, Shiman, Yin, Siqi, Qiao, Qin, Song, Zhijian

论文摘要

图像融合是一种将来自多个源图像的信息与互补信息整合在一起的技术，以改善单个图像的丰富性。由于特定于任务的训练数据和相应的地面真相，大多数现有的端到端图像融合方法很容易属于过度拟合或乏味的参数优化过程。两阶段的方法通过在大型自然图像数据集上训练编码器 - 编码器网络避免了大量特定于任务的训练数据，并利用提取的功能进行融合，但是自然图像和不同的融合任务之间的域间隙导致性能有限。在这项研究中，我们设计了一种基于编码器的新型图像融合框架，并提出了基于破坏造成的自我监督训练方案，以鼓励网络学习特定于任务的特征。具体而言，我们提出了三个破坏性重建自我监督的辅助任务，用于多模式图像融合，多曝光图像融合和基于像素强度非线性转换，亮度转换和噪声转换的多对聚焦图像融合。为了鼓励不同的融合任务互相提升并提高训练有素的网络的普遍性，我们通过随机选择其中一个在模型培训中破坏自然图像来整合三个自我监管的辅助任务。此外，我们设计了一个新的编码器，将CNN和Transformer结合在一起以进行特征提取，以便训练有素的模型可以利用本地和全局信息。关于多模式图像融合，多曝光图像融合和多聚焦图像融合任务的广泛实验表明，我们所提出的方法在主观和客观评估中都实现了最新的性能。该代码将很快公开可用。

Image fusion is a technique to integrate information from multiple source images with complementary information to improve the richness of a single image. Due to insufficient task-specific training data and corresponding ground truth, most existing end-to-end image fusion methods easily fall into overfitting or tedious parameter optimization processes. Two-stage methods avoid the need of large amount of task-specific training data by training encoder-decoder network on large natural image datasets and utilizing the extracted features for fusion, but the domain gap between natural images and different fusion tasks results in limited performance. In this study, we design a novel encoder-decoder based image fusion framework and propose a destruction-reconstruction based self-supervised training scheme to encourage the network to learn task-specific features. Specifically, we propose three destruction-reconstruction self-supervised auxiliary tasks for multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion based on pixel intensity non-linear transformation, brightness transformation and noise transformation, respectively. In order to encourage different fusion tasks to promote each other and increase the generalizability of the trained network, we integrate the three self-supervised auxiliary tasks by randomly choosing one of them to destroy a natural image in model training. In addition, we design a new encoder that combines CNN and Transformer for feature extraction, so that the trained model can exploit both local and global information. Extensive experiments on multi-modal image fusion, multi-exposure image fusion and multi-focus image fusion tasks demonstrate that our proposed method achieves the state-of-the-art performance in both subjective and objective evaluations. The code will be publicly available soon.

下载PDF全文

下载文献需遵守相关版权规定

论文标题