F2NET：学习专注于无监督视频对象细分的前景

论文标题

F2NET：学习专注于无监督视频对象细分的前景

F2Net: Learning to Focus on the Foreground for Unsupervised Video Object Segmentation

论文作者

Liu, Daizong, Yu, Dongdong, Wang, Changhu, Zhou, Pan

论文摘要

尽管基于深度学习的方法在无监督的视频对象细分方面取得了巨大进展，但困难的场景（例如，视觉相似性，遮挡和外观改变）仍未得到很好的处理。为了减轻这些问题，我们提出了对前景网络（F2NET）的新颖关注，该专注于前景对象的内部框架详细信息，从而有效地改善了细分性能。具体而言，我们提出的网络由三个主要部分组成：暹罗编码器模块，中心引导外观扩散模块和动态信息融合模块。首先，我们采用暹罗编码器来提取配对帧的特征表示（参考框架和当前框架）。然后，设计了一个中心引导外观扩散模块，以捕获框架间特征（参考帧和当前帧之间的密集对应关系），框架内特征（当前帧中的密集对应关系）以及当前帧的原始语义特征。具体而言，我们建立了一个中心预测分支，以预测前景对象在当前帧中的中心位置，并在增强框架间和框架内特征提取之前将中心点信息作为空间指导，从而将特征表示重点放在前景对象上。最后，我们提出了一个动态信息融合模块，以通过上述三个不同级别的特征自动选择相对重要的功能。 Davis2016，YouTube-Object和FBMS数据集的广泛实验表明，我们提出的F2NET可以通过显着改善实现最先进的性能。

Although deep learning based methods have achieved great progress in unsupervised video object segmentation, difficult scenarios (e.g., visual similarity, occlusions, and appearance changing) are still not well-handled. To alleviate these issues, we propose a novel Focus on Foreground Network (F2Net), which delves into the intra-inter frame details for the foreground objects and thus effectively improve the segmentation performance. Specifically, our proposed network consists of three main parts: Siamese Encoder Module, Center Guiding Appearance Diffusion Module, and Dynamic Information Fusion Module. Firstly, we take a siamese encoder to extract the feature representations of paired frames (reference frame and current frame). Then, a Center Guiding Appearance Diffusion Module is designed to capture the inter-frame feature (dense correspondences between reference frame and current frame), intra-frame feature (dense correspondences in current frame), and original semantic feature of current frame. Specifically, we establish a Center Prediction Branch to predict the center location of the foreground object in current frame and leverage the center point information as spatial guidance prior to enhance the inter-frame and intra-frame feature extraction, and thus the feature representation considerably focus on the foreground objects. Finally, we propose a Dynamic Information Fusion Module to automatically select relatively important features through three aforementioned different level features. Extensive experiments on DAVIS2016, Youtube-object, and FBMS datasets show that our proposed F2Net achieves the state-of-the-art performance with significant improvement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题