SACT：自我意识的多空间特征组合变压器，用于视频字幕的多项式注意

论文标题

SACT：自我意识的多空间特征组合变压器，用于视频字幕的多项式注意

SACT: Self-Aware Multi-Space Feature Composition Transformer for Multinomial Attention for Video Captioning

论文作者

Sur, Chiranjib

论文摘要

视频字幕对两个基本概念，特征检测和特征组成作品。尽管现代变形金刚在构图特征方面有益，但它们缺乏选择和理解内容的基本问题。随着特征长度的增加，包括提高相关内容的捕获的规定变得越来越重要。在这项工作中，我们引入了一种自我意识组合变压器（SACT）的新概念，该概念能够产生多项式注意（Multatt），这是一种生成各种帧组合的分布的方式。此外，多头注意力变压器的作用是将所有可能的注意内容组合起来的原理，这对自然语言分类有益，但具有视频字幕的局限性。视频内容具有重复性，需要解析重要内容以获得更好的内容组成。在这项工作中，我们引入了SACT，以提供更多选择性的关注，并将它们组合起来，以提供不同的注意力头，以更好地捕获任何应用程序的可用内容。为了解决多样化的问题并鼓励选择性利用，我们提出了自我意识的组成变压器模型，用于密集的视频字幕，并将技术应用于两个基准数据集（例如ActivityNet和YouCookii）。

Video captioning works on the two fundamental concepts, feature detection and feature composition. While modern day transformers are beneficial in composing features, they lack the fundamental problems of selecting and understanding of the contents. As the feature length increases, it becomes increasingly important to include provisions for improved capturing of the pertinent contents. In this work, we have introduced a new concept of Self-Aware Composition Transformer (SACT) that is capable of generating Multinomial Attention (MultAtt) which is a way of generating distributions of various combinations of frames. Also, multi-head attention transformer works on the principle of combining all possible contents for attention, which is good for natural language classification, but has limitations for video captioning. Video contents have repetitions and require parsing of important contents for better content composition. In this work, we have introduced SACT for more selective attention and combined them for different attention heads for better capturing of the usable contents for any applications. To address the problem of diversification and encourage selective utilization, we propose the Self-Aware Composition Transformer model for dense video captioning and apply the technique on two benchmark datasets like ActivityNet and YouCookII.

下载PDF全文

下载文献需遵守相关版权规定

论文标题