使用视频增强技术进行多人人类机器人互动的联合参与分类

论文标题

使用视频增强技术进行多人人类机器人互动的联合参与分类

Joint Engagement Classification using Video Augmentation Techniques for Multi-person Human-robot Interaction

论文作者

Kim, Yubin, Chen, Huili, Alghowinem, Sharifa, Breazeal, Cynthia, Park, Hae Won

论文摘要

影响理解能力对于社交机器人以直观和相互的方式与一组用户自主互动至关重要。但是，多人情感理解的挑战不仅来自对每个用户的情感状态（例如参与）的准确看法，而且还识别成员之间的情感相互作用（例如，关节参与），它们在它们之间具有复杂而微妙的非语言交流。在这里，我们提出了一个新颖的混合框架，用于通过将深度学习框架与各种视频增强技术相结合来识别亲子二元组的共同参与。我们首先使用四种视频增强技术（General Aug，Deepfake，Cutout和Comperied DataSets）应用的数据集，使用家中的父母二元组的数据集阅读故事书以及在家中的社交机器人，以提高联合参与分类性能。其次，我们证明了在机器人 - 父母的互动环境中使用训练有素的模型的实验结果。第三，我们介绍了一个基于行为的指标，用于评估模型的学会表示，以研究模型的解释性，以识别联合参与。这项工作是完全解锁大型公共数据集预训练的端到端视频理解模型的潜力的第一步，并使用数据增强和可视化技术增强，以影响野外的多人人类手机互动中的识别。

Affect understanding capability is essential for social robots to autonomously interact with a group of users in an intuitive and reciprocal way. However, the challenge of multi-person affect understanding comes from not only the accurate perception of each user's affective state (e.g., engagement) but also the recognition of the affect interplay between the members (e.g., joint engagement) that presents as complex, but subtle, nonverbal exchanges between them. Here we present a novel hybrid framework for identifying a parent-child dyad's joint engagement by combining a deep learning framework with various video augmentation techniques. Using a dataset of parent-child dyads reading storybooks together with a social robot at home, we first train RGB frame- and skeleton-based joint engagement recognition models with four video augmentation techniques (General Aug, DeepFake, CutOut, and Mixed) applied datasets to improve joint engagement classification performance. Second, we demonstrate experimental results on the use of trained models in the robot-parent-child interaction context. Third, we introduce a behavior-based metric for evaluating the learned representation of the models to investigate the model interpretability when recognizing joint engagement. This work serves as the first step toward fully unlocking the potential of end-to-end video understanding models pre-trained on large public datasets and augmented with data augmentation and visualization techniques for affect recognition in the multi-person human-robot interaction in the wild.

下载PDF全文

下载文献需遵守相关版权规定

论文标题