论文标题
与事件无关的多相声音事件的本地化和检测网络
Event-Independent Network for Polyphonic Sound Event Localization and Detection
论文作者
论文摘要
复音声音事件的定位和检测不仅在检测发生的声音事件,而且还定位相应的声音源。这一系列任务首先是在Dcase 2019 Task 3中引入的。在2020年,声音事件本地化和检测任务引入了移动声音源和重叠事件案例的其他挑战,其中包括两个具有两个不同方向(DOA)角度的相同类型的事件。在本文中,提出了一个新型的与多相声音事件事件的无关网络,并提出了检测。与我们在Dcase 2019 Task 3中提出的两阶段方法不同,这个新网络是完全端到端的。网络的输入是一阶Ambisonics(FOA)时域信号,然后将其馈入1D卷积层以提取声学特征。然后将网络分成两个并行分支。第一个分支是用于声音事件检测(SED),第二个分支用于DOA估计。网络有三种预测,SED预测,DOA预测以及事件活动检测(EAD)预测,用于结合SED和DOA特征以进行现场和非集合估计。所有这些预测的格式具有两个曲目的格式,表明最多有两个重叠事件。在每个曲目中,最多可能发生一个事件。该体系结构引入了轨道排列问题。为了解决此问题,使用了帧级置换不变训练方法。实验结果表明,所提出的方法可以检测多形声音事件及其相应的DOA。与基线方法相比,它在任务3数据集上的性能大大提高。
Polyphonic sound event localization and detection is not only detecting what sound events are happening but localizing corresponding sound sources. This series of tasks was first introduced in DCASE 2019 Task 3. In 2020, the sound event localization and detection task introduces additional challenges in moving sound sources and overlapping-event cases, which include two events of the same type with two different direction-of-arrival (DoA) angles. In this paper, a novel event-independent network for polyphonic sound event localization and detection is proposed. Unlike the two-stage method we proposed in DCASE 2019 Task 3, this new network is fully end-to-end. Inputs to the network are first-order Ambisonics (FOA) time-domain signals, which are then fed into a 1-D convolutional layer to extract acoustic features. The network is then split into two parallel branches. The first branch is for sound event detection (SED), and the second branch is for DoA estimation. There are three types of predictions from the network, SED predictions, DoA predictions, and event activity detection (EAD) predictions that are used to combine the SED and DoA features for on-set and off-set estimation. All of these predictions have the format of two tracks indicating that there are at most two overlapping events. Within each track, there could be at most one event happening. This architecture introduces a problem of track permutation. To address this problem, a frame-level permutation invariant training method is used. Experimental results show that the proposed method can detect polyphonic sound events and their corresponding DoAs. Its performance on the Task 3 dataset is greatly increased as compared with that of the baseline method.