论文标题
音频到视频发电的多模态自适应归一化
Multi Modal Adaptive Normalization for Audio to Video Generation
论文作者
论文摘要
语音驱动的面部视频生成是一个复杂的问题,因为其多模式方面(即音频和视频域)。音频包括许多潜在的功能,例如表达,音高,响度,韵律(口语风格)和面部视频在头部移动,眼睛眨眼,嘴唇同步和各种面部动作单元的运动以及暂时的平滑度方面具有很大的可变性。从音频输入和静态图像中综合表现力的面部视频,对于生成对抗网络仍然是一项具有挑战性的任务。在本文中,我们提出了一个基于多模式的自适应归一化(MAN)结构,以使用AS INPUT:音频信号和一个人的单个图像来综合说话的人视频。该体系结构使用多模式自适应归一化,按键点热图预测指标,光流预测器和类激活图[58]层[58]来学习表达性面部成分的运动,因此产生了给定人的高度表达性的说话视频。多模式的自适应归一化使用音频和视频的各种特征,例如MEL频谱图,音调,音频信号的能量以及预测的KePoint Heatmap/光学流量和单个图像来学习各自的仿射参数,以生成高度表达的视频。实验评估表明,与gan(rsdgan)[53],Speech2VID [10]和其他方法相比,所提出的方法的表现卓越,在多种定量指标上,包括:ssim(结构相似性指数),psnr(信号比率),cpbd(cpbd and secs secs secs secl rack),werm resd(lm sef rack rack rack and secim and secim and secim and secim and secripliatiationally seps and secept2VID [10],以及其他定量指标(结构相似性指数) 距离)。此外,定性评估和在线图灵测试证明了我们方法的功效。
Speech-driven facial video generation has been a complex problem due to its multi-modal aspects namely audio and video domain. The audio comprises lots of underlying features such as expression, pitch, loudness, prosody(speaking style) and facial video has lots of variability in terms of head movement, eye blinks, lip synchronization and movements of various facial action units along with temporal smoothness. Synthesizing highly expressive facial videos from the audio input and static image is still a challenging task for generative adversarial networks. In this paper, we propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components and hence generates a highly expressive talking-head video of the given person. The multi-modal adaptive normalization uses the various features of audio and video such as Mel spectrogram, pitch, energy from audio signals and predicted keypoint heatmap/optical flow and a single image to learn the respective affine parameters to generate highly expressive video. Experimental evaluation demonstrates superior performance of the proposed method as compared to Realistic Speech-Driven Facial Animation with GANs(RSDGAN) [53], Speech2Vid [10], and other approaches, on multiple quantitative metrics including: SSIM (structural similarity index), PSNR (peak signal to noise ratio), CPBD (image sharpness), WER(word error rate), blinks/sec and LMD(landmark distance). Further, qualitative evaluation and Online Turing tests demonstrate the efficacy of our approach.