论文标题
在变形金刚中的视听性上下文剥削中,掩盖了唇部同步的预测
Masked Lip-Sync Prediction by Audio-Visual Contextual Exploitation in Transformers
论文作者
论文摘要
先前的研究已经探索了给定音频条件的任意目标的准确生成唇部同步的面孔。但是,它们中的大多数变形或产生整个面部区域,从而导致非现实结果。在这项工作中,我们深入研究了仅改变目标人的口腔形状的制定。这需要掩盖很大一部分原始图像,并借助音频和参考帧无缝地插入它。为此,我们提出了视听性上下文感知的变压器(AV-cat)框架,该框架通过预测掩盖的口形状来产生具有光真实质量的精确唇部同步。我们的关键见解是利用设计经过精心设计的变压器在音频和视觉方式中提供的所需上下文信息。具体而言,我们提出了一个卷积转换器混合主链,并设计了一种基于注意的融合策略来填充蒙面零件。它统一地关注有关未掩盖区域和参考框架的纹理信息。然后,语义音频信息涉及增强自我发场计算。此外,带有音频注入的改进网络可提高图像和唇部同步质量。广泛的实验验证了我们的模型可以为任意受试者产生高保真性唇部同步结果。
Previous studies have explored generating accurately lip-synced talking faces for arbitrary targets given audio conditions. However, most of them deform or generate the whole facial area, leading to non-realistic results. In this work, we delve into the formulation of altering only the mouth shapes of the target person. This requires masking a large percentage of the original image and seamlessly inpainting it with the aid of audio and reference frames. To this end, we propose the Audio-Visual Context-Aware Transformer (AV-CAT) framework, which produces accurate lip-sync with photo-realistic quality by predicting the masked mouth shapes. Our key insight is to exploit desired contextual information provided in audio and visual modalities thoroughly with delicately designed Transformers. Specifically, we propose a convolution-Transformer hybrid backbone and design an attention-based fusion strategy for filling the masked parts. It uniformly attends to the textural information on the unmasked regions and the reference frame. Then the semantic audio information is involved in enhancing the self-attention computation. Additionally, a refinement network with audio injection improves both image and lip-sync quality. Extensive experiments validate that our model can generate high-fidelity lip-synced results for arbitrary subjects.