PESTTRON：从普通百姓的声音中迈出有声读物

论文标题

PESTTRON：从普通百姓的声音中迈出有声读物

Pitchtron: Towards audiobook generation from ordinary people's voices

论文作者

Jung, Sunghee, Kim, Hoirin

论文摘要

在本文中，我们探讨了在相当现实的情况下为有声读物生成的韵律转移，其中训练DB主要来自多个普通人，而推断期间给出的参考音频来自专业人士，而不是训练DB的韵律。具体来说，我们探索了转移韩国方言和情感演讲，即使训练集主要由标准和中立的韩国人组成。我们发现，在这种设置下，原始的全球样式令牌方法会在音高，能量和暂停长度上产生不良的故障。为了解决这个问题，我们提出了两个模型，即硬和软螺旋桨，并释放我们开发的工具包和语料库。硬质体系使用螺距作为解码器的输入，而软浇头则使用音调作为韵律编码器的输入。我们通过客观和主观测试验证提出的模型的有效性。对于硬质体隆和软螺旋桨，GST上的Axy得分分别为2.01和1.14。

In this paper, we explore prosody transfer for audiobook generation under rather realistic condition where training DB is plain audio mostly from multiple ordinary people and reference audio given during inference is from professional and richer in prosody than training DB. To be specific, we explore transferring Korean dialects and emotive speech even though training set is mostly composed of standard and neutral Korean. We found that under this setting, original global style token method generates undesirable glitches in pitch, energy and pause length. To deal with this issue, we propose two models, hard and soft pitchtron and release the toolkit and corpus that we have developed. Hard pitchtron uses pitch as input to the decoder while soft pitchtron uses pitch as input to the prosody encoder. We verify the effectiveness of proposed models with objective and subjective tests. AXY score over GST is 2.01 and 1.14 for hard pitchtron and soft pitchtron respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题