野外视觉辅助声源深度估计

论文标题

野外视觉辅助声源深度估计

Visual-Assisted Sound Source Depth Estimation in the Wild

论文作者

Sun, Wei, Qiu, Lili

论文摘要

深度估计可以实现各种各样的3D应用，例如机器人技术，自动驾驶和虚拟现实。尽管在这一领域进行了大量工作，但它仍然开放，如何实现准确，低成本，高分辨率和大范围的深度估计。受到闪到孔现象的启发（即看到闪电后听到雷声），本文开发了FBDepth，这是第一个视听深度估计框架。它采用光线的飞行时间（TOF）与声音之间的差异，以推断声源深度。 FBDepth是第一个将视频和音频与语义功能和空间提示相结合的，以供范围估算。它首先将视频轨道和音轨之间的对应关系对应，以在粗粒度中找到目标对象和目标声音。根据对移动对象的轨迹的观察，FBDepth提议估算声音产生前后光流的相交，以及时定位视频事件。 FBDEPTH为视频事件的估计时间戳和最终深度估计的音频剪辑提供了供应。我们使用手机收集3000多个视频剪辑，其中20个不同的对象，最高5000万美元。与基于RGB的方法相比，FBDEPTH将绝对相对误差（ABSREL）降低了55 \％。

Depth estimation enables a wide variety of 3D applications, such as robotics, autonomous driving, and virtual reality. Despite significant work in this area, it remains open how to enable accurate, low-cost, high-resolution, and large-range depth estimation. Inspired by the flash-to-bang phenomenon (i.e. hearing the thunder after seeing the lightning), this paper develops FBDepth, the first audio-visual depth estimation framework. It takes the difference between the time-of-flight (ToF) of the light and the sound to infer the sound source depth. FBDepth is the first to incorporate video and audio with both semantic features and spatial hints for range estimation. It first aligns correspondence between the video track and audio track to locate the target object and target sound in a coarse granularity. Based on the observation of moving objects' trajectories, FBDepth proposes to estimate the intersection of optical flow before and after the sound production to locate video events in time. FBDepth feeds the estimated timestamp of the video event and the audio clip for the final depth estimation. We use a mobile phone to collect 3000+ video clips with 20 different objects at up to $50m$. FBDepth decreases the Absolute Relative error (AbsRel) by 55\% compared to RGB-based methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题