论文标题

GraphPB:语音合成中韵律边界的图形表示

GraphPB: Graphical Representations of Prosody Boundary in Speech Synthesis

论文作者

Sun, Aolan, Wang, Jianzong, Cheng, Ning, Peng, Huayi, Zeng, Zhen, Kong, Lingwei, Xiao, Jing

论文摘要

本文在中国语音综合的任务中介绍了韵律边界(GraphPB)的图形表示方法,该方法打算解析图形域中输入序列的语义和句法关系,以改善韵律性能。图嵌入的节点是由韵律单词形成的,边缘由其他韵律边界形成,即韵律短语边界(PPH)和语调短语边界(IPH)。不同的图神经网络(GNN)(例如门控图形神经网络(GGNN)和图形长期记忆(G-LSTM))被用作图形编码器来利用图形韵律边界信息。图形模型由图编码器和注意解码器提出并形成。提出了两种技术将顺序信息嵌入到图形到语音模型中。实验结果表明,这种提出的方​​法可以编码发音的语音和韵律节奏。这些GNN模型的平均意见评分(MOS)与最先进的序列模型显示了比较结果,在韵律方面具有更好的性能。这为端到端语音综合中的韵律建模提供了另一种方法。

This paper introduces a graphical representation approach of prosody boundary (GraphPB) in the task of Chinese speech synthesis, intending to parse the semantic and syntactic relationship of input sequences in a graphical domain for improving the prosody performance. The nodes of the graph embedding are formed by prosodic words, and the edges are formed by the other prosodic boundaries, namely prosodic phrase boundary (PPH) and intonation phrase boundary (IPH). Different Graph Neural Networks (GNN) like Gated Graph Neural Network (GGNN) and Graph Long Short-term Memory (G-LSTM) are utilised as graph encoders to exploit the graphical prosody boundary information. Graph-to-sequence model is proposed and formed by a graph encoder and an attentional decoder. Two techniques are proposed to embed sequential information into the graph-to-sequence text-to-speech model. The experimental results show that this proposed approach can encode the phonetic and prosody rhythm of an utterance. The mean opinion score (MOS) of these GNN models shows comparative results with the state-of-the-art sequence-to-sequence models with better performance in the aspect of prosody. This provides an alternative approach for prosody modelling in end-to-end speech synthesis.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源