论文标题
可靠的可视化扬声器识别
Reliable Visualization for Deep Speaker Recognition
论文作者
论文摘要
尽管卷积神经网络(CNN)在说话者认可方面取得了令人印象深刻的成功,但我们对CNNS内部功能的理解仍然有限。一个主要的障碍是一些流行的可视化工具很难应用,例如那些生成显着图的人。原因是说话者信息在时间频率空间中没有显示清晰的空间模式,这使得很难解释可视化结果,因此很难确认可视化工具的可靠性。在本文中,我们对基于CAM的三种流行可视化方法进行了广泛的分析:Grad-CAM,Score-CAM和Layer-Cam,以研究其对说话者识别任务的可靠性。在最先进的RESNET34SE模型上进行的实验表明,层板算法可以产生可靠的可视化,因此可以用作解释基于CNN的扬声器模型的有前途的工具。源代码和示例可在我们的项目页面中提供:http://project.cslt.org/。
In spite of the impressive success of convolutional neural networks (CNNs) in speaker recognition, our understanding to CNNs' internal functions is still limited. A major obstacle is that some popular visualization tools are difficult to apply, for example those producing saliency maps. The reason is that speaker information does not show clear spatial patterns in the temporal-frequency space, which makes it hard to interpret the visualization results, and hence hard to confirm the reliability of a visualization tool. In this paper, we conduct an extensive analysis on three popular visualization methods based on CAM: Grad-CAM, Score-CAM and Layer-CAM, to investigate their reliability for speaker recognition tasks. Experiments conducted on a state-of-the-art ResNet34SE model show that the Layer-CAM algorithm can produce reliable visualization, and thus can be used as a promising tool to explain CNN-based speaker models. The source code and examples are available in our project page: http://project.cslt.org/.