论文标题
行动中的变压器:弱监督的行动细分
Transformers in Action: Weakly Supervised Action Segmentation
论文作者
论文摘要
在较弱的监督形式(例如成绩单监督)下,定期探索视频动作分割任务,其中与密集的框架标签更容易获得动作清单。在此公式中,由于强调动作过渡点,长序列长度和框架上下文化,该任务对序列建模方法提出了各种挑战,这使得为变压器的任务做得很好。给定的发展使变压器能够线性扩展,我们通过架构进行了演示,如何将它们应用于基于等效的RNN模型的动作对准精度,而注意机制则集中在显着的动作过渡区域。此外,鉴于最近关注推理时间成绩单选择,我们提出了一种补充成绩单嵌入方法,以在推理时间更快地选择成绩单。此外,我们随后证明了这种方法还可以改善整体分割性能。最后,我们评估了整个基准数据集的建议方法,以更好地了解变压器的适用性以及在此视频驱动的弱监督任务上选择成绩单的重要性。
The video action segmentation task is regularly explored under weaker forms of supervision, such as transcript supervision, where a list of actions is easier to obtain than dense frame-wise labels. In this formulation, the task presents various challenges for sequence modeling approaches due to the emphasis on action transition points, long sequence lengths, and frame contextualization, making the task well-posed for transformers. Given developments enabling transformers to scale linearly, we demonstrate through our architecture how they can be applied to improve action alignment accuracy over the equivalent RNN-based models with the attention mechanism focusing around salient action transition regions. Additionally, given the recent focus on inference-time transcript selection, we propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time. Furthermore, we subsequently demonstrate how this approach can also improve the overall segmentation performance. Finally, we evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers and the importance of transcript selection on this video-driven weakly-supervised task.