基于注意力的遗留语音肖像模型，用于面对生成

论文标题

基于注意力的遗留语音肖像模型，用于面对生成

Attention-based Residual Speech Portrait Model for Speech to Face Generation

论文作者

Wang, Jianrong, Hu, Xiaosheng, Liu, Li, Liu, Wei, Yu, Mei, Xu, Tianyi

论文摘要

鉴于演讲者的讲话，很有趣的是，是否有可能产生这个说话者的脸。这项任务的一个主要挑战是减轻面部和言语之间的自然不匹配。为此，在本文中，我们提出了一个新型的基于注意力的剩余语音肖像模型（AR-SPM），将残留的理想引入混合编码器架构中，在该架构中，面部先验特征与语音编码器的输出合并以形成最终的面部特征。特别是，我们创新建立了三个项目损失函数，这是L2-norm，L1-norm和负余弦损失的加权线性组合，以比较最终的面部特征和真实的面部特征来训练我们的模型。对Avspech数据集的评估表明，我们提出的模型可以加速培训的收敛性，优于最先进的面部质量，并与地面真相相比，达到了性别和年龄的卓越识别精度。

Given a speaker's speech, it is interesting to see if it is possible to generate this speaker's face. One main challenge in this task is to alleviate the natural mismatch between face and speech. To this end, in this paper, we propose a novel Attention-based Residual Speech Portrait Model (AR-SPM) by introducing the ideal of the residual into a hybrid encoder-decoder architecture, where face prior features are merged with the output of speech encoder to form the final face feature. In particular, we innovatively establish a tri-item loss function, which is a weighted linear combination of the L2-norm, L1-norm and negative cosine loss, to train our model by comparing the final face feature and true face feature. Evaluation on AVSpeech dataset shows that our proposed model accelerates the convergence of training, outperforms the state-of-the-art in terms of quality of the generated face, and achieves superior recognition accuracy of gender and age compared with the ground truth.

下载PDF全文

下载文献需遵守相关版权规定

论文标题