论文标题
唇部结构卷积和自我注意力的唇部阅读
Lip-reading with Hierarchical Pyramidal Convolution and Self-Attention
论文作者
论文摘要
在本文中,我们提出了一种新颖的深度学习体系结构来改善单词级的唇部阅读。一方面,我们首先将多尺度处理引入用于唇部阅读的空间特征提取。特别是,我们提出了分层金字塔卷积(HPCONV),以替换原始模块中的标准卷积,从而改善了模型发现细粒唇部运动的能力。另一方面,我们通过利用自我注意力在顺序的所有时间步骤中合并信息,以使模型更加关注相关帧。将这两个优点合并在一起,以进一步增强模型的分类能力。关于野生(LRW)数据集的唇读的实验表明,我们所提出的模型的精度达到了86.83%,比当前最新面临的实验可实现1.53%的绝对改善。我们还进行了广泛的实验,以更好地了解所提出模型的行为。
In this paper, we propose a novel deep learning architecture to improving word-level lip-reading. On the one hand, we first introduce the multi-scale processing into the spatial feature extraction for lip-reading. Specially, we proposed hierarchical pyramidal convolution (HPConv) to replace the standard convolution in original module, leading to improvements over the model's ability to discover fine-grained lip movements. On the other hand, we merge information in all time steps of the sequence by utilizing self-attention, to make the model pay more attention to the relevant frames. These two advantages are combined together to further enhance the model's classification power. Experiments on the Lip Reading in the Wild (LRW) dataset show that our proposed model has achieved 86.83% accuracy, yielding 1.53% absolute improvement over the current state-of-the-art. We also conducted extensive experiments to better understand the behavior of the proposed model.