野外的演讲者认可

论文标题

野外的演讲者认可

Speaker Recognition in the Wild

论文作者

Chhimwal, Neeraj, Gupta, Anirudh, Gaur, Rishabh, Chadha, Harveen Singh, Shah, Priyanshi, Dhuriya, Ankur, Raghavan, Vivek

论文摘要

在本文中，我们提出了一条管道，以查找说话者的数量，以及属于每个扬声器的音频，现在在音频数据源中确定了扬声器，在这些音频数据源中，说话者或说话者标签的数量尚不清楚。我们将这种方法用作数据准备管道的一部分，以指示语言（https://github.com/open-speech-ekstep/vakyansh-wav2vec2-ecperimentation）进行语音识别。为了了解和评估我们提出的管道的准确性，我们介绍了两个指标：群集纯度和群集唯一性。群集纯度量化了群集的“纯”。另一方面，群集唯一性量化了哪些百分比的簇仅属于单个主导扬声器。我们在\ Ref {sec：sec：sec：sec}部分中讨论了这些指标。 Since we develop this utility to aid us in identifying data based on speaker IDs before training an Automatic Speech Recognition (ASR) model, and since most of this data takes considerable effort to scrape, we also conclude that 98\% of data gets mapped to the top 80\% of clusters (computed by removing any clusters with less than a fixed number of utterances -- we do this to get rid of some very small clusters and use this threshold as 30), in选择的测试集。

In this paper, we propose a pipeline to find the number of speakers, as well as audios belonging to each of these now identified speakers in a source of audio data where number of speakers or speaker labels are not known a priori. We used this approach as a part of our Data Preparation pipeline for Speech Recognition in Indic Languages (https://github.com/Open-Speech-EkStep/vakyansh-wav2vec2-experimentation). To understand and evaluate the accuracy of our proposed pipeline, we introduce two metrics: Cluster Purity, and Cluster Uniqueness. Cluster Purity quantifies how "pure" a cluster is. Cluster Uniqueness, on the other hand, quantifies what percentage of clusters belong only to a single dominant speaker. We discuss more on these metrics in section \ref{sec:metrics}. Since we develop this utility to aid us in identifying data based on speaker IDs before training an Automatic Speech Recognition (ASR) model, and since most of this data takes considerable effort to scrape, we also conclude that 98\% of data gets mapped to the top 80\% of clusters (computed by removing any clusters with less than a fixed number of utterances -- we do this to get rid of some very small clusters and use this threshold as 30), in the test set chosen.

下载PDF全文

下载文献需遵守相关版权规定

论文标题