野生动物扩展视频的时空事件细分和本地化

论文标题

野生动物扩展视频的时空事件细分和本地化

Spatio-Temporal Event Segmentation and Localization for Wildlife Extended Videos

论文作者

Mounir, Ramy, Gula, Roman, Theuerkauf, Jörn, Sarkar, Sudeep

论文摘要

使用离线培训方案，研究人员通过通过手动注释的标签或基于自我监督的时期培训来提供完整或弱点，解决了事件细分问题。大多数作品都考虑到最多10分钟的视频。我们提出了一个自我监督的感知预测框架，能够随着时间的推移构建对象的稳定表示，并在长时间的视频中证明它，跨越了几天。该方法看似简单，但非常有效。我们依靠对标准深度学习主链计算的高级特征的预测。为了进行预测，我们使用的是使用注意机制增强的LSTM，并使用预测误差以自我监督的方式训练。自我学习的注意图有效地定位并跟踪每个帧中与事件相关的对象。提出的方法不需要标签。它只需要单个视频，而没有单独的培训集。鉴于缺乏很长的视频数据集，我们演示了我们在10天（254小时）连续野生动植物监视数据的视频中的方法，我们收集了所需的权限。我们发现该方法在各种环境条件（例如白天/夜间条件，雨水，阴影和大风条件下）具有牢固的态度。对于暂时定位事件的任务，我们以20％的假阳性分段为20％的召回率。在活动水平上，我们每50分钟一次有80％的虚假活动召回率。我们将制作数据集，这是第一个此类数据集，以及研究界可用的代码。

Using offline training schemes, researchers have tackled the event segmentation problem by providing full or weak-supervision through manually annotated labels or self-supervised epoch-based training. Most works consider videos that are at most 10's of minutes long. We present a self-supervised perceptual prediction framework capable of temporal event segmentation by building stable representations of objects over time and demonstrate it on long videos, spanning several days. The approach is deceptively simple but quite effective. We rely on predictions of high-level features computed by a standard deep learning backbone. For prediction, we use an LSTM, augmented with an attention mechanism, trained in a self-supervised manner using the prediction error. The self-learned attention maps effectively localize and track the event-related objects in each frame. The proposed approach does not require labels. It requires only a single pass through the video, with no separate training set. Given the lack of datasets of very long videos, we demonstrate our method on video from 10 days (254 hours) of continuous wildlife monitoring data that we had collected with required permissions. We find that the approach is robust to various environmental conditions such as day/night conditions, rain, sharp shadows, and windy conditions. For the task of temporally locating events, we had an 80% recall rate at 20% false-positive rate for frame-level segmentation. At the activity level, we had an 80% activity recall rate for one false activity detection every 50 minutes. We will make the dataset, which is the first of its kind, and the code available to the research community.

下载PDF全文

下载文献需遵守相关版权规定

论文标题