论文标题
无监督的机器人技术音频感知学习:学习将数据投影到T-SNE/UMAP空间
Unsupervised Learning of Audio Perception for Robotics Applications: Learning to Project Data to T-SNE/UMAP space
论文作者
论文摘要
音频感知是解决各种问题的关键,从声学场景分析,音乐元数据提取,建议,合成和分析。它也可能会增加计算机在执行人类在日常活动中毫不费力地做的任务。本文以关键想法为基础,以建立对触摸声音的感知,而无需访问任何地面真实数据。我们展示了如何利用经典信号处理的想法来以高精度获取大量感兴趣的数据。然后使用这些声音以及图像将声音映射到这些图像潜在表示的聚类空间。这种方法,不仅允许我们学习可能感兴趣的声音的语义表示,而且还允许将不同的方式与学习的区别相关联。经过培训的模型将声音映射到该聚类表示,提供了合理的性能,而不是收集大量人类注释数据的昂贵方法。这种方法可用于为使用一些信号处理功能描述的任何感兴趣的声音构建一种先进的感知模型。使用信号处理与神经体系结构和无标记数据的高维聚类相结合的雏菊链接高精度声音事件检测器是一个非常有力的想法,将来可以通过多种方式探索。
Audio perception is a key to solving a variety of problems ranging from acoustic scene analysis, music meta-data extraction, recommendation, synthesis and analysis. It can potentially also augment computers in doing tasks that humans do effortlessly in day-to-day activities. This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data. We show how we can leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision. These sounds are then used, along with the images to map the sounds to a clustered space of the latent representation of these images. This approach, not only allows us to learn semantic representation of the possible sounds of interest, but also allows association of different modalities to the learned distinctions. The model trained to map sounds to this clustered representation, gives reasonable performance as opposed to expensive methods collecting a lot of human annotated data. Such approaches can be used to build a state of art perceptual model for any sound of interest described using a few signal processing features. Daisy chaining high precision sound event detectors using signal processing combined with neural architectures and high dimensional clustering of unlabelled data is a vastly powerful idea, and can be explored in a variety of ways in future.