零拍动识别的跨模式表示学习

论文标题

零拍动识别的跨模式表示学习

Cross-modal Representation Learning for Zero-shot Action Recognition

论文作者

Lin, Chung-Ching, Lin, Kevin, Li, Linjie, Wang, Lijuan, Liu, Zicheng

论文摘要

我们提出了一个基于跨模式变压器的框架，该框架共同编码视频数据和文本标签，以进行零拍动识别（ZSAR）。我们的模型采用了概念上的新管道，通过该管道，通过端到端方式，通过该管道与视觉语义的关联一起学习视觉表示。该模型设计为可以在共享知识空间中学习的视觉和语义表示提供了一种自然机制，从而鼓励学习的视觉嵌入是歧视性的，并且在语义上更加一致。在零击中，我们设计了一种简单的语义传输方案，该方案嵌入了所见类和看不见的类之间的语义相关性信息以复合看不见的视觉原型。因此，可以保留和利用视觉结构中的判别特征，以减轻信息丢失，语义差距和枢轴性问题的典型零射门问题。在严格的零射击设置未在其他数据集上进行预训练，实验结果表明，我们的模型对ZSAR的艺术状态有了显着改善，从而在UCF101，HMDB51和ActivityNET基准基准数据集上达到了令人鼓舞的TOP-1准确性。代码将提供。

We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR). Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. The model design provides a natural mechanism for visual and semantic representations to be learned in a shared knowledge space, whereby it encourages the learned visual embedding to be discriminative and more semantically consistent. In zero-shot inference, we devise a simple semantic transfer scheme that embeds semantic relatedness information between seen and unseen classes to composite unseen visual prototypes. Accordingly, the discriminative features in the visual structure could be preserved and exploited to alleviate the typical zero-shot issues of information loss, semantic gap, and the hubness problem. Under a rigorous zero-shot setting of not pre-training on additional datasets, the experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets. Code will be made available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题