论文标题
向我展示我喜欢的东西:使用基于内容的多头注意来检测特定用户的视频突出显示
Show Me What I Like: Detecting User-Specific Video Highlights Using Content-Based Multi-Head Attention
论文作者
论文摘要
我们提出了一种方法,可以根据他们观看过的先前视频中标记的首选亮点剪辑来检测给定目标视频中用户的个性化亮点。我们的方法使用预先训练的对象和人类活动的特征明确利用了首选剪辑和目标视频的内容。我们设计了一种多头注意机制,可以根据其基于对象和人类活动的内容来适应优选的剪辑,并将这些权重融合为每个用户的单个特征表示。我们计算这些每个用户功能表示形式与从所需目标视频计算的每个框架功能之间的相似性,以估算目标视频中用户特定的突出显示剪辑。我们在包含单个用户带注释的亮点的大型突出显示数据集上测试我们的方法。与当前的基线相比,我们观察到在检测到的高光的平均平均精度中,绝对提高了2-4%。我们还对与每个用户以及基于对象和人类活动的特征表示相关的首选突出显示剪辑的数量进行了广泛的消融实验,以验证我们的方法确实是基于内容的和特定于用户的。
We propose a method to detect individualized highlights for users on given target videos based on their preferred highlight clips marked on previous videos they have watched. Our method explicitly leverages the contents of both the preferred clips and the target videos using pre-trained features for the objects and the human activities. We design a multi-head attention mechanism to adaptively weigh the preferred clips based on their object- and human-activity-based contents, and fuse them using these weights into a single feature representation for each user. We compute similarities between these per-user feature representations and the per-frame features computed from the desired target videos to estimate the user-specific highlight clips from the target videos. We test our method on a large-scale highlight detection dataset containing the annotated highlights of individual users. Compared to current baselines, we observe an absolute improvement of 2-4% in the mean average precision of the detected highlights. We also perform extensive ablation experiments on the number of preferred highlight clips associated with each user as well as on the object- and human-activity-based feature representations to validate that our method is indeed both content-based and user-specific.