通过视觉插值和框架选择指导有效的视频注释

论文标题

通过视觉插值和框架选择指导有效的视频注释

Efficient video annotation with visual interpolation and frame selection guidance

论文作者

Kuznetsova, A., Talati, A., Luo, Y., Simmons, K., Ferrari, V.

论文摘要

我们引入了一个带有边界框的通用视频注释的统一框架。视频注释是一个长期存在的问题，因为这是一个乏味且耗时的过程。我们应对视频注释的两个重要挑战：（1）人类注释器在所有帧子集上提供的自动时间插值和外推，以及（2）自动选择框架以手动注释。我们的贡献是两个方面：首先，我们提出了一个具有插值和推断功能的模型。其次，我们提出了一种指导机制，该机制根据先前的注释依次生成有关接下来的框架的建议。我们在模拟中广泛评估了几个具有挑战性的数据集的方法，并证明了在线性插值上绘制的手动边界框数量减少了60％，而在现成的跟踪器上则减少了35％。此外，我们还显示了使用边界框的视频注释的最新方法[25]，我们还显示了10％的注释时间改进。最后，我们进行了人类注释实验，并对结果进行了广泛的分析，这表明我们的方法将实际测量的注释时间降低了50％，与常用的线性插值相比。

We introduce a unified framework for generic video annotation with bounding boxes. Video annotation is a longstanding problem, as it is a tedious and time-consuming process. We tackle two important challenges of video annotation: (1) automatic temporal interpolation and extrapolation of bounding boxes provided by a human annotator on a subset of all frames, and (2) automatic selection of frames to annotate manually. Our contribution is two-fold: first, we propose a model that has both interpolating and extrapolating capabilities; second, we propose a guiding mechanism that sequentially generates suggestions for what frame to annotate next, based on the annotations made previously. We extensively evaluate our approach on several challenging datasets in simulation and demonstrate a reduction in terms of the number of manual bounding boxes drawn by 60% over linear interpolation and by 35% over an off-the-shelf tracker. Moreover, we also show 10% annotation time improvement over a state-of-the-art method for video annotation with bounding boxes [25]. Finally, we run human annotation experiments and provide extensive analysis of the results, showing that our approach reduces actual measured annotation time by 50% compared to commonly used linear interpolation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题