使用眼镜摄像头的人类互动的头和眼睛以中心的手势识别

论文标题

使用眼镜摄像头的人类互动的头和眼睛以中心的手势识别

Head and eye egocentric gesture recognition for human-robot interaction using eyewear cameras

论文作者

Marina-Miranda, Javier, Traver, V. Javier

论文摘要

非语言交流在人类机器人相互作用（HRI）的各种场景中起着特别重要的作用。因此，这项工作解决了人类手势识别的问题。特别是，我们专注于头和眼手势，并使用眼镜摄像机采用以自我为中心的（第一人称）观点。我们认为，这种自我为中心的观点可能对场景或以机器人为中心的观点提供了许多概念和技术利益。提出了一种基于运动的识别方法，该方法以两个时间粒度运行。在本地，框架到框架的同谱是通过卷积神经网络（CNN）估算的。该CNN的输出输入了长期记忆（LSTM），以捕获与表征手势相关的长期时间视觉关系。关于网络体系结构的配置，一个特别有趣的发现是，使用同型CNN的内部层的输出增加了使用同型矩阵本身的识别率。尽管这项工作侧重于行动识别，并且尚未进行机器人或用户研究，但该系统旨在满足实时限制。令人鼓舞的结果表明，所提出的以自我为中心的观点是可行的，这项概念验证工作为HRI令人兴奋的领域提供了新颖而有用的贡献。

Non-verbal communication plays a particularly important role in a wide range of scenarios in Human-Robot Interaction (HRI). Accordingly, this work addresses the problem of human gesture recognition. In particular, we focus on head and eye gestures, and adopt an egocentric (first-person) perspective using eyewear cameras. We argue that this egocentric view may offer a number of conceptual and technical benefits over scene- or robot-centric perspectives. A motion-based recognition approach is proposed, which operates at two temporal granularities. Locally, frame-to-frame homographies are estimated with a convolutional neural network (CNN). The output of this CNN is input to a long short-term memory (LSTM) to capture longer-term temporal visual relationships, which are relevant to characterize gestures. Regarding the configuration of the network architecture, one particularly interesting finding is that using the output of an internal layer of the homography CNN increases the recognition rate with respect to using the homography matrix itself. While this work focuses on action recognition, and no robot or user study has been conducted yet, the system has been designed to meet real-time constraints. The encouraging results suggest that the proposed egocentric perspective is viable, and this proof-of-concept work provides novel and useful contributions to the exciting area of HRI.

下载PDF全文

下载文献需遵守相关版权规定

论文标题