联合手运动和交互热点从以自我为中心视频的预测

论文标题

联合手运动和交互热点从以自我为中心视频的预测

Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos

论文作者

Liu, Shaowei, Tripathi, Subarna, Majumdar, Somdeb, Wang, Xiaolong

论文摘要

我们建议在以自我为中心的视频中预测未来的手动相互作用。我们没有预测动作标签或像素，而是直接预测下一个活动对象（即相互作用热点）上的手运动轨迹和未来的接触点。这种相对较低的维度表示提供了对未来相互作用的具体描述。为了解决这项任务，我们首先提供了一种自动方法来收集大规模数据上的轨迹和热点标签。然后，我们使用此数据来训练以对象为中心的变压器（OCT）模型进行预测。我们的模型通过变压器中的自发机制执行手和对象相互作用推理。 OCT还提供了一个概率框架，以对未来的轨迹和热点进行采样，以处理预测中的不确定性。我们在Epic-Kitchens-55，Epic-Kitchens-100和Egtea凝视+数据集上进行实验，并表明OCT显着超过了最先进的方法。项目页面可在https://stevenlsw.github.io/hoi-forecast上找到。

We propose to forecast future hand-object interactions given an egocentric video. Instead of predicting action labels or pixels, we directly predict the hand motion trajectory and the future contact points on the next active object (i.e., interaction hotspots). This relatively low-dimensional representation provides a concrete description of future interactions. To tackle this task, we first provide an automatic way to collect trajectory and hotspots labels on large-scale data. We then use this data to train an Object-Centric Transformer (OCT) model for prediction. Our model performs hand and object interaction reasoning via the self-attention mechanism in Transformers. OCT also provides a probabilistic framework to sample the future trajectory and hotspots to handle uncertainty in prediction. We perform experiments on the Epic-Kitchens-55, Epic-Kitchens-100, and EGTEA Gaze+ datasets, and show that OCT significantly outperforms state-of-the-art approaches by a large margin. Project page is available at https://stevenlsw.github.io/hoi-forecast .

下载PDF全文

下载文献需遵守相关版权规定

论文标题