探测与扬声器相关的任务的深扬声器嵌入

论文标题

探测与扬声器相关的任务的深扬声器嵌入

Probing Deep Speaker Embeddings for Speaker-related Tasks

论文作者

Zhao, Zifeng, Pan, Ding, Peng, Junyi, Gu, Rongzhi

论文摘要

在说话者的识别以及其他与说话者相关的任务中，深扬声器的嵌入显示出了令人鼓舞的结果。但是，例如，仍在探索某些问题，例如，这些表示形式中编码的信息及其对下游任务的影响。本文研究了四个深扬声器的嵌入，即D-Vector，X-Vector，Resnetse-34和Ecapa-TDNN。受到人类语音机制的启发，我们从身份，内容和渠道的角度探索了可能编码的信息。基于此，对三类与说话者相关的任务进行了实验，以进一步探索不同深层嵌入的影响，包括歧视性任务（说话者验证和诊断），指导任务（目标扬声器检测和提取）以及调节任务（多语言者对语音对话）。结果表明，除了说话者身份外，所有深层嵌入式编码的通道和内容信息都可能有所不同，并且它们在与说话者相关的任务上的性能可能会大不相同：ECAPA-TDNN在歧视任务中占主导地位，而D-vector则导致指导任务，而调节任务对说话者的选择较少敏感。这些可能会受益于使用扬声器嵌入的未来研究。

Deep speaker embeddings have shown promising results in speaker recognition, as well as in other speaker-related tasks. However, some issues are still under explored, for instance, the information encoded in these representations and their influence on downstream tasks. Four deep speaker embeddings are studied in this paper, namely, d-vector, x-vector, ResNetSE-34 and ECAPA-TDNN. Inspired by human voice mechanisms, we explored possibly encoded information from perspectives of identity, contents and channels; Based on this, experiments were conducted on three categories of speaker-related tasks to further explore impacts of different deep embeddings, including discriminative tasks (speaker verification and diarization), guiding tasks (target speaker detection and extraction) and regulating tasks (multi-speaker text-to-speech). Results show that all deep embeddings encoded channel and content information in addition to speaker identity, but the extent could vary and their performance on speaker-related tasks can be tremendously different: ECAPA-TDNN is dominant in discriminative tasks, and d-vector leads the guiding tasks, while regulating task is less sensitive to the choice of speaker representations. These may benefit future research utilizing speaker embeddings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题