论文标题
联合语音活动和与多EXIT体系结构重叠的检测
Joint Speech Activity and Overlap Detection with Multi-Exit Architecture
论文作者
论文摘要
在多方转换的情况下,重叠的语音检测(OSD)对于语音应用至关重要。尽管进行了许多研究工作和进展,与语音活动检测(VAD)相比,OSD仍然是一个开放的挑战,其总体表现远非令人满意。大多数先前的研究通常将OSD问题作为标准分类问题提出,以识别二进制(OSD)或三级标签(联合VAD和OSD)的语音。与主流相反,本研究从新的角度研究了联合VAD和OSD任务。特别是,我们建议使用多EXIT体系结构扩展传统的分类网络。这样的体系结构使我们的系统具有独特的功能,可以使用早期出口的低级功能或上次出口的高级功能来识别类。此外,采用了两种培训方案,知识蒸馏和密集的联系,以进一步提高我们的系统性能。基准数据集(AMI和DIHARD-III)的实验结果验证了我们提出的系统的有效性和一般性。我们的消融进一步揭示了拟议方案的互补贡献。在AMI上的$ F_1 $得分为0.792,而Dihard-III上的0.625得分为0.625,我们提出的系统在这些数据集上的表现优于几个顶级性能模型,但在两个数据集中都超过了当前的最新型号。除了性能收益外,我们提出的系统还为质量复杂性权衡提供了另一个吸引人的潜力,这是有效的OSD部署的高度优先。
Overlapped speech detection (OSD) is critical for speech applications in scenario of multi-party conversion. Despite numerous research efforts and progresses, comparing with speech activity detection (VAD), OSD remains an open challenge and its overall performance is far from satisfactory. The majority of prior research typically formulates the OSD problem as a standard classification problem, to identify speech with binary (OSD) or three-class label (joint VAD and OSD) at frame level. In contrast to the mainstream, this study investigates the joint VAD and OSD task from a new perspective. In particular, we propose to extend traditional classification network with multi-exit architecture. Such an architecture empowers our system with unique capability to identify class using either low-level features from early exits or high-level features from last exit. In addition, two training schemes, knowledge distillation and dense connection, are adopted to further boost our system performance. Experimental results on benchmark datasets (AMI and DIHARD-III) validated the effectiveness and generality of our proposed system. Our ablations further reveal the complementary contribution of proposed schemes. With $F_1$ score of 0.792 on AMI and 0.625 on DIHARD-III, our proposed system outperforms several top performing models on these datasets, but also surpasses the current state-of-the-art by large margins across both datasets. Besides the performance benefit, our proposed system offers another appealing potential for quality-complexity trade-offs, which is highly preferred for efficient OSD deployment.