论文标题
使用3D CNN的全局关注的弱监督行动定位和行动识别
Weakly-Supervised Action Localization and Action Recognition using Global-Local Attention of 3D CNN
论文作者
论文摘要
3D卷积神经网络(3D CNN)捕获了3D数据(例如视频序列)的空间和时间信息。但是,由于卷积和汇总机制,信息损失似乎不可避免。为了改善3D CNN中的视觉解释和分类,我们提出了两种方法。 i)使用经过训练的3DRESNEXT网络和ii)实施注意力门控网络以提高动作识别的准确性。提出的方法旨在通过视觉归因,弱监督的动作定位和动作识别来显示3D CNN中每个层被称为全局本地关注的每个层的有用性。首先,使用有关最大预测类的反向传播对3DRESNEXT进行了训练并应用于行动分类。然后对每一层的梯度和激活进行上采样。后来,聚合用于引起更多细微的关注,这指出了预测班级输入视频中最关键的部分。我们使用最终注意力的轮廓阈值进行最终定位。我们使用3DCAM使用细粒度的视觉解释来评估修剪视频中的空间和时间动作定位。实验结果表明,所提出的方法会产生内容丰富的视觉解释和歧视性的关注。此外,每一层通过注意门控的动作识别比基线模型产生更好的分类结果。
3D Convolutional Neural Network (3D CNN) captures spatial and temporal information on 3D data such as video sequences. However, due to the convolution and pooling mechanism, the information loss seems unavoidable. To improve the visual explanations and classification in 3D CNN, we propose two approaches; i) aggregate layer-wise global to local (global-local) discrete gradients using trained 3DResNext network, and ii) implement attention gating network to improve the accuracy of the action recognition. The proposed approach intends to show the usefulness of every layer termed as global-local attention in 3D CNN via visual attribution, weakly-supervised action localization, and action recognition. Firstly, the 3DResNext is trained and applied for action classification using backpropagation concerning the maximum predicted class. The gradients and activations of every layer are then up-sampled. Later, aggregation is used to produce more nuanced attention, which points out the most critical part of the predicted class's input videos. We use contour thresholding of final attention for final localization. We evaluate spatial and temporal action localization in trimmed videos using fine-grained visual explanation via 3DCam. Experimental results show that the proposed approach produces informative visual explanations and discriminative attention. Furthermore, the action recognition via attention gating on each layer produces better classification results than the baseline model.