斜无人机视频的自我监督的单眼深度估计

论文标题

斜无人机视频的自我监督的单眼深度估计

Self-supervised monocular depth estimation from oblique UAV videos

论文作者

Madhuanand, Logambal, Nex, Francesco, Yang, Michael Ying

论文摘要

无人机已成为必不可少的摄影测量，因为它们是负担得起的，易于访问和通用的。从无人机捕获的航空图像在小规模纹理映射，3D建模，对象检测任务，DTM和DSM生成等中都有应用。摄影测量技术通常用于从无人机图像中进行3D重建，其中获取了同一场景的多个图像。计算机视觉和深度学习技术的发展使单像深度估计（侧）成为了强烈的研究领域。在无人机图像上使用侧技术可以克服对3D重建的多个图像的需求。本文旨在使用深度学习来估算单个UAV空中图像的深度。我们遵循一种自制的学习方法，自我监督的单眼深度估计（SMDE），该方法不需要地面真理深度或除图像以外的任何其他信息以估算深度。单眼视频框架用于训练深度学习模型，该模型通过两个不同的网络共同学习深度和姿势信息，一个网络以深度和姿势。预测的深度和姿势用于利用视频中的时间信息从另一个图像的角度重建一个图像。我们提出了一个具有两个2D CNN编码器和一个3D CNN解码器的新型体系结构，用于从连续的时间帧中提取信息。引入了一个对比损失术语，以提高图像产生的质量。我们的实验是在公共uavid视频数据集上进行的。实验结果表明，我们的模型在估计深度方面的表现优于最先进的方法。

UAVs have become an essential photogrammetric measurement as they are affordable, easily accessible and versatile. Aerial images captured from UAVs have applications in small and large scale texture mapping, 3D modelling, object detection tasks, DTM and DSM generation etc. Photogrammetric techniques are routinely used for 3D reconstruction from UAV images where multiple images of the same scene are acquired. Developments in computer vision and deep learning techniques have made Single Image Depth Estimation (SIDE) a field of intense research. Using SIDE techniques on UAV images can overcome the need for multiple images for 3D reconstruction. This paper aims to estimate depth from a single UAV aerial image using deep learning. We follow a self-supervised learning approach, Self-Supervised Monocular Depth Estimation (SMDE), which does not need ground truth depth or any extra information other than images for learning to estimate depth. Monocular video frames are used for training the deep learning model which learns depth and pose information jointly through two different networks, one each for depth and pose. The predicted depth and pose are used to reconstruct one image from the viewpoint of another image utilising the temporal information from videos. We propose a novel architecture with two 2D CNN encoders and a 3D CNN decoder for extracting information from consecutive temporal frames. A contrastive loss term is introduced for improving the quality of image generation. Our experiments are carried out on the public UAVid video dataset. The experimental results demonstrate that our model outperforms the state-of-the-art methods in estimating the depths.

下载PDF全文

下载文献需遵守相关版权规定

论文标题