遥感预训练的经验研究

论文标题

遥感预训练的经验研究

An Empirical Study of Remote Sensing Pretraining

论文作者

Wang, Di, Zhang, Jing, Du, Bo, Xia, Gui-Song, Tao, Dacheng

论文摘要

深度学习已经在很大程度上重塑了遥感（RS）研究，以实现空中图像的理解，并取得了巨大的成功。然而，大多数现有的深层模型都是使用ImageNet预估计的重量初始化的。由于自然图像不可避免地会出现相对于空中图像的较大域间隙，因此可能限制了下游航空场景任务上的填充性能。这个问题促使我们对空中图像进行遥感预审计（RSP）进行实证研究。为此，我们借助最大的RS场景识别数据集训练不同的网络 - 百万富翁，以获取一系列RS预处理的骨架，包括卷积神经网络（CNN）和诸如SWIN和VITAE等视觉变压器（例如SWIN和VITAE），这些主持器在计算机视觉任务上表现出了有希望的表现。然后，我们研究了RSP对使用这些CNN和Vision Transformer骨架的代表性下游任务的影响，包括场景识别，语义分割，对象检测和更改检测。实证研究表明，RSP可以在场景识别任务和感知相关语义（例如“桥”和“飞机”等相关语义中提供独特的性能。我们还发现，尽管RSP减轻了在RS图像上预处理传统影像网的数据差异，但它仍然可能会遭受任务差异的困扰，在下游任务中需要与场景识别任务不同的表示。这些发现要求对大规模预处理数据集和有效的预处理方法进行进一步的研究工作。代码和验证的模型将在https://github.com/vitae-transformer/vitae-transformer-remote-sensing上发布。

Deep learning has largely reshaped remote sensing (RS) research for aerial image understanding and made a great success. Nevertheless, most of the existing deep models are initialized with the ImageNet pretrained weights. Since natural images inevitably present a large domain gap relative to aerial images, probably limiting the finetuning performance on downstream aerial scene tasks. This issue motivates us to conduct an empirical study of remote sensing pretraining (RSP) on aerial images. To this end, we train different networks from scratch with the help of the largest RS scene recognition dataset up to now -- MillionAID, to obtain a series of RS pretrained backbones, including both convolutional neural networks (CNN) and vision transformers such as Swin and ViTAE, which have shown promising performance on computer vision tasks. Then, we investigate the impact of RSP on representative downstream tasks including scene recognition, semantic segmentation, object detection, and change detection using these CNN and vision transformer backbones. Empirical study shows that RSP can help deliver distinctive performances in scene recognition tasks and in perceiving RS related semantics such as "Bridge" and "Airplane". We also find that, although RSP mitigates the data discrepancies of traditional ImageNet pretraining on RS images, it may still suffer from task discrepancies, where downstream tasks require different representations from scene recognition tasks. These findings call for further research efforts on both large-scale pretraining datasets and effective pretraining methods. The codes and pretrained models will be released at https://github.com/ViTAE-Transformer/ViTAE-Transformer-Remote-Sensing.

下载PDF全文

下载文献需遵守相关版权规定

论文标题