论文标题
利用深层句子上下文来表达端到端语音综合
Exploiting Deep Sentential Context for Expressive End-to-End Speech Synthesis
论文作者
论文摘要
基于注意力的SEQ2SEQ文本到语音系统,尤其是那些使用自我注意力网络(SAN)的系统,已经达到了最先进的性能。但是,具有丰富韵律的表达性语料库仍然具有挑战性的模型为1)韵律方面,这些方面跨越了不同的句子粒度并主要确定声学表现力,难以量化和标记,以及2)当前的seq2seq框架提取了韵律信息,从而完全从文本编码中提取了韵律信息,这些信息很容易与表达性的表达式相这些。在本文中,我们提出了一个基于SAN基于SAN的文本编码器的上下文提取器,以充分利用基于SEQ2SEQ的TTS的表达式语料库来充分利用句子上下文。我们的上下文提取器首先从不同的SAN层收集与韵律相关的句子上下文信息,然后将它们汇总以学习全面的句子表示,以增强最终生成的语音的表现。具体而言,我们研究了上下文聚集的两种方法:1)直接连接不同SAN层的输出的直接聚集,以及2)加权聚合,使用多头关注来自动学习不同SAN层的贡献。两个表达语料库的实验表明,我们的方法可以产生更自然的语音,并具有更丰富的韵律变化,而加权聚集在建模表达性方面更为更好。
Attention-based seq2seq text-to-speech systems, especially those use self-attention networks (SAN), have achieved state-of-art performance. But an expressive corpus with rich prosody is still challenging to model as 1) prosodic aspects, which span across different sentential granularities and mainly determine acoustic expressiveness, are difficult to quantize and label and 2) the current seq2seq framework extracts prosodic information solely from a text encoder, which is easily collapsed to an averaged expression for expressive contents. In this paper, we propose a context extractor, which is built upon SAN-based text encoder, to sufficiently exploit the sentential context over an expressive corpus for seq2seq-based TTS. Our context extractor first collects prosodic-related sentential context information from different SAN layers and then aggregates them to learn a comprehensive sentence representation to enhance the expressiveness of the final generated speech. Specifically, we investigate two methods of context aggregation: 1) direct aggregation which directly concatenates the outputs of different SAN layers, and 2) weighted aggregation which uses multi-head attention to automatically learn contributions for different SAN layers. Experiments on two expressive corpora show that our approach can produce more natural speech with much richer prosodic variations, and weighted aggregation is more superior in modeling expressivity.