语义意识的自适应知识蒸馏传感器到视觉动作识别

论文标题

语义意识的自适应知识蒸馏传感器到视觉动作识别

Semantics-aware Adaptive Knowledge Distillation for Sensor-to-Vision Action Recognition

论文作者

Liu, Yang, Wang, Keze, Li, Guanbin, Lin, Liang

论文摘要

现有的基于视觉的动作识别容易受到阻塞和外观变化的影响，而可穿戴的传感器可以通过用一维时间序列信号捕获人类运动来减轻这些挑战。对于相同的动作，从视觉传感器和可穿戴传感器中学到的知识可能是相关和互补的。但是，在数据维度，数据分布和固有的信息内容中，通过可穿戴传感器捕获的动作数据与视觉传感器捕获的动作数据之间存在很大的模态差异。在本文中，我们提出了一个新颖的框架，称为语义感知的自适应知识蒸馏网络（SAKDN），以通过自适应地传输和蒸馏出从多个可穿戴传感器中的知识来增强视觉传感器模式（视频）中的动作识别。 SAKDN使用多个可穿戴传感器作为教师方式，并将RGB视频用作学生模式。为了保护局部时间关系并促进使用视觉深度学习模型，我们通过设计基于Gramian Angular Field的虚拟图像生成模型，将可穿戴传感器的一维时间序列信号转换为二维图像。然后，我们构建了一种新颖的提供相似性的自适应多模式融合模块，以适应来自不同教师网络的中间表示知识。最后，为了将多个训练有素的教师网络的知识充分利用和转移到学生网络中，我们提出了一种新颖的图形指导的语义歧视映射损失，该损失采用了图形引导的消融分析，以产生一个良好的视觉解释，强调了跨模态的重要区域并同时保留了原始数据的相互关系。伯克利 - 梅哈德（Berkeley-Mhad），UTD-MHAD和MMACT数据集的实验结果很好地证明了我们提出的SAKDN的有效性。

Existing vision-based action recognition is susceptible to occlusion and appearance variations, while wearable sensors can alleviate these challenges by capturing human motion with one-dimensional time-series signal. For the same action, the knowledge learned from vision sensors and wearable sensors, may be related and complementary. However, there exists significantly large modality difference between action data captured by wearable-sensor and vision-sensor in data dimension, data distribution and inherent information content. In this paper, we propose a novel framework, named Semantics-aware Adaptive Knowledge Distillation Networks (SAKDN), to enhance action recognition in vision-sensor modality (videos) by adaptively transferring and distilling the knowledge from multiple wearable sensors. The SAKDN uses multiple wearable-sensors as teacher modalities and uses RGB videos as student modality. To preserve local temporal relationship and facilitate employing visual deep learning model, we transform one-dimensional time-series signals of wearable sensors to two-dimensional images by designing a gramian angular field based virtual image generation model. Then, we build a novel Similarity-Preserving Adaptive Multi-modal Fusion Module to adaptively fuse intermediate representation knowledge from different teacher networks. Finally, to fully exploit and transfer the knowledge of multiple well-trained teacher networks to the student network, we propose a novel Graph-guided Semantically Discriminative Mapping loss, which utilizes graph-guided ablation analysis to produce a good visual explanation highlighting the important regions across modalities and concurrently preserving the interrelations of original data. Experimental results on Berkeley-MHAD, UTD-MHAD and MMAct datasets well demonstrate the effectiveness of our proposed SAKDN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题