论文标题
Speech2phone:一种新颖有效的训练扬声器识别模型的方法
Speech2Phone: A Novel and Efficient Method for Training Speaker Recognition Models
论文作者
论文摘要
在本文中,我们提出了一种有效的方法,用于使用小型或资源不足的数据集培训扬声器识别的模型。与其他SOTA(最新)方法相比,此方法需要的数据少。角度原型和GE2E损失功能,同时获得与这些方法相似的结果。这是使用说话者声音中音素重建的知识来完成的。为此,建立了一个新的数据集,由40位男发言人组成,他们在葡萄牙语中阅读句子,总计约3H。我们比较了使用我们的方法训练的三个最佳体系结构,以选择最佳的架构,即具有浅层建筑的最佳体系结构。然后,我们将该模型与SOTA方法进行了用于扬声器识别任务的SOTA方法:使用损失函数角度典型和GE2E训练了大约2,000小时的快速Resnet-34。用不同语言的数据集进行了三个实验。在这三个实验中,我们的模型在两个实验中取得了第二好的结果,并且是其中一个实验的最佳结果。这凸显了我们方法的重要性,这被证明是SOTA扬声器识别模型的出色竞争者,数据少了500倍,方法更简单。
In this paper we present an efficient method for training models for speaker recognition using small or under-resourced datasets. This method requires less data than other SOTA (State-Of-The-Art) methods, e.g. the Angular Prototypical and GE2E loss functions, while achieving similar results to those methods. This is done using the knowledge of the reconstruction of a phoneme in the speaker's voice. For this purpose, a new dataset was built, composed of 40 male speakers, who read sentences in Portuguese, totaling approximately 3h. We compare the three best architectures trained using our method to select the best one, which is the one with a shallow architecture. Then, we compared this model with the SOTA method for the speaker recognition task: the Fast ResNet-34 trained with approximately 2,000 hours, using the loss functions Angular Prototypical and GE2E. Three experiments were carried out with datasets in different languages. Among these three experiments, our model achieved the second best result in two experiments and the best result in one of them. This highlights the importance of our method, which proved to be a great competitor to SOTA speaker recognition models, with 500x less data and a simpler approach.