图像和音频音乐转录的晚期多模式融合

论文标题

图像和音频音乐转录的晚期多模式融合

Late multimodal fusion for image and audio music transcription

论文作者

Alfaro-Contreras, María, Valero-Mas, Jose J., Iñesta, José M., Calvo-Zaragoza, Jorge

论文摘要

音乐转录处理音乐源转化为结构化数字格式，是音乐信息检索（MIR）的关键问题。用计算术语来应对这一挑战时，MIR社区遵循两条研究：音乐文档，这是光学识别（OMR）或录音的情况，这就是自动音乐转录（AMT）的情况。上述输入数据的不同性质使这些字段的条件以开发特定于模式的框架。但是，它们在序列标记任务方面的最新定义导致了共同的输出表示形式，从而可以对合并范式进行研究。在这方面，多模式图像和音频音乐转录包括有效地结合图像和音频方式传达的信息的挑战。在这项工作中，我们在后期融合级别探讨了这个问题：我们研究了四种组合方法，以便首次合并基于晶格的搜索空间中有关端到端OMR和AMT系统的假设。一系列性能场景获得的结果（相应的单模式模型产生了不同的错误率）显示了这些方法的有趣好处。此外，在四种策略中，两种认为显着改善了相应的单峰标准识别框架。

Music transcription, which deals with the conversion of music sources into a structured digital format, is a key problem for Music Information Retrieval (MIR). When addressing this challenge in computational terms, the MIR community follows two lines of research: music documents, which is the case of Optical Music Recognition (OMR), or audio recordings, which is the case of Automatic Music Transcription (AMT). The different nature of the aforementioned input data has conditioned these fields to develop modality-specific frameworks. However, their recent definition in terms of sequence labeling tasks leads to a common output representation, which enables research on a combined paradigm. In this respect, multimodal image and audio music transcription comprises the challenge of effectively combining the information conveyed by image and audio modalities. In this work, we explore this question at a late-fusion level: we study four combination approaches in order to merge, for the first time, the hypotheses regarding end-to-end OMR and AMT systems in a lattice-based search space. The results obtained for a series of performance scenarios -- in which the corresponding single-modality models yield different error rates -- showed interesting benefits of these approaches. In addition, two of the four strategies considered significantly improve the corresponding unimodal standard recognition frameworks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题