基于图形的多视图融合和本地适应：缓解说话者身份的内部内部混乱性

论文标题

基于图形的多视图融合和本地适应：缓解说话者身份的内部内部混乱性

Graph-based Multi-View Fusion and Local Adaptation: Mitigating Within-Household Confusability for Speaker Identification

论文作者

Chen, Long, Meng, Yixiong, Ravichandran, Venkatesh, Stolcke, Andreas

论文摘要

在家庭场景中（例如，对于智能扬声器）中的说话者身份（SID）是一个重要但挑战性的问题，因为标记的（注册）话语，令人困惑的声音和人口不平衡。传统的说话者识别系统从大量随机的说话者样本中概括，从而使识别的识别表现不佳。在这项工作中，我们提出了一种基于图形的半监督学习方法，以通过局部适应的图形归一化和多视图图的多信号融合来提高家庭级的SID准确性和鲁棒性。与其他有关家庭SID，公平性和信号融合的工作不同，这项工作集中在说话者标签推理（评分）上，并提供了一个简单的解决方案，可以实现家庭特定的适应性和多信号融合而不调整嵌入或培训融合网络。 Voxceleb数据集上的实验表明，我们的方法始终提高客户群体不同和混乱程度的家庭的绩效。

Speaker identification (SID) in the household scenario (e.g., for smart speakers) is an important but challenging problem due to limited number of labeled (enrollment) utterances, confusable voices, and demographic imbalances. Conventional speaker recognition systems generalize from a large random sample of speakers, causing the recognition to underperform for households drawn from specific cohorts or otherwise exhibiting high confusability. In this work, we propose a graph-based semi-supervised learning approach to improve household-level SID accuracy and robustness with locally adapted graph normalization and multi-signal fusion with multi-view graphs. Unlike other work on household SID, fairness, and signal fusion, this work focuses on speaker label inference (scoring) and provides a simple solution to realize household-specific adaptation and multi-signal fusion without tuning the embeddings or training a fusion network. Experiments on the VoxCeleb dataset demonstrate that our approach consistently improves the performance across households with different customer cohorts and degrees of confusability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题