S2igan：通过对抗学习的语音到图像生成

论文标题

S2igan：通过对抗学习的语音到图像生成

S2IGAN: Speech-to-Image Generation via Adversarial Learning

论文作者

Wang, Xinsheng, Qiao, Tingting, Zhu, Jihua, Hanjalic, Alan, Scharenborg, Odette

论文摘要

估计世界语言的一半没有书面形式，因此这些语言不可能从任何基于文本的技术中受益。在本文中，提出了语音到图像生成（S2IG）框架，该框架将语音描述转化为照片真实的图像，而无需使用任何文本信息，从而使不成文的语言有可能从该技术中受益。所提出的名为S2igan的S2IG框架由语音嵌入网络（SEN）和一个纯粹的密集堆积的生成模型（RDG）组成。森通过对相应的视觉信息的监督学习嵌入的语音。在SEN产生的语音嵌入的条件下，提出的RDG合成了与相应的语音描述一致的图像。在两个公共基准数据集Cub和Oxford-102上进行了广泛的实验，证明了拟议的S2igan对来自语音信号的高质量和语义上一致的图像的合成的有效性，从而产生了良好的性能和S2IG任务的坚实基线。

An estimated half of the world's languages do not have a written form, making it impossible for these languages to benefit from any existing text-based technologies. In this paper, a speech-to-image generation (S2IG) framework is proposed which translates speech descriptions to photo-realistic images without using any text information, thus allowing unwritten languages to potentially benefit from this technology. The proposed S2IG framework, named S2IGAN, consists of a speech embedding network (SEN) and a relation-supervised densely-stacked generative model (RDG). SEN learns the speech embedding with the supervision of the corresponding visual information. Conditioned on the speech embedding produced by SEN, the proposed RDG synthesizes images that are semantically consistent with the corresponding speech descriptions. Extensive experiments on two public benchmark datasets CUB and Oxford-102 demonstrate the effectiveness of the proposed S2IGAN on synthesizing high-quality and semantically-consistent images from the speech signal, yielding a good performance and a solid baseline for the S2IG task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题