论文标题
通过对齐梯度更新和内部特征分布对多模式序列的数据有效对齐
Data-efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions
论文作者
论文摘要
视频和文本序列对齐的任务是朝着对电影视频和剧本的共同理解的前提步骤。但是,监督方法面临着有限的现实培训数据的障碍。在本文中,我们试图提高端到端对齐网络Neumatch的数据效率[15]。最近的研究[56]表明,处理不同模式的网络组件可能会以不同的速度过度合适,从而造成了训练困难。我们建议采用(1)层的自适应速率缩放(LARS)来对齐不同层中梯度更新的幅度,并平衡学习速度,以及(2)序列批次归一化(SBN),以使内部特征分布与不同模态的内部特征分布对齐。最后,我们利用随机投影来降低输入特征的维度。在YouTube电影摘要数据集中,当省略LSMDC数据集上的预处理并实现最新结果时,这些技术的合并使用缩小了性能差距。广泛的经验比较和分析表明,这些技术比两种不同的层归一化设置更有效地改善了优化和正规化网络。
The task of video and text sequence alignment is a prerequisite step toward joint understanding of movie videos and screenplays. However, supervised methods face the obstacle of limited realistic training data. With this paper, we attempt to enhance data efficiency of the end-to-end alignment network NeuMATCH [15]. Recent research [56] suggests that network components dealing with different modalities may overfit and generalize at different speeds, creating difficulties for training. We propose to employ (1) layer-wise adaptive rate scaling (LARS) to align the magnitudes of gradient updates in different layers and balance the pace of learning and (2) sequence-wise batch normalization (SBN) to align the internal feature distributions from different modalities. Finally, we leverage random projection to reduce the dimensionality of input features. On the YouTube Movie Summary dataset, the combined use of these technique closes the performance gap when the pretraining on the LSMDC dataset is omitted and achieves the state-of-the-art result. Extensive empirical comparisons and analysis reveal that these techniques improve optimization and regularize the network more effectively than two different setups of layer normalization.