基于音频的近乎简短的视频检索和音频相似性学习

论文标题

基于音频的近乎简短的视频检索和音频相似性学习

Audio-based Near-Duplicate Video Retrieval with Audio Similarity Learning

论文作者

Avgoustinakis, Pavlos, Kordopatis-Zilos, Giorgos, Papadopoulos, Symeon, Symeonidis, Andreas L., Kompatsiaris, Ioannis

论文摘要

在这项工作中，我们解决了基于音频的近乎近图视频检索的问题。我们提出了音频相似性学习（AUSIL）方法，可有效捕获视频对之间的音频相似性的时间模式。对于两个视频之间的鲁棒相似性计算，我们首先利用基于在大规模的音频事件数据集中训练的卷积神经网络（CNN）来提取基于音频的视频描述符，然后计算从这些描述符的成对相似性中得出的相似性矩阵。随后将相似性矩阵馈送到CNN网络，该网络捕获其内容中存在的时间结构。我们按照三胞胎生成过程训练网络并优化三胞胎损耗函数。为了评估所提出方法的有效性，我们根据其视频之间的音频重复性手动注释了两个公开可用的视频数据集。与三种最新方法相比，提出的方法取得了非常具竞争力的结果。同样，与竞争方法不同，它对于通过速度转换产生的音频重复项的检索非常健壮。

In this work, we address the problem of audio-based near-duplicate video retrieval. We propose the Audio Similarity Learning (AuSiL) approach that effectively captures temporal patterns of audio similarity between video pairs. For the robust similarity calculation between two videos, we first extract representative audio-based video descriptors by leveraging transfer learning based on a Convolutional Neural Network (CNN) trained on a large scale dataset of audio events, and then we calculate the similarity matrix derived from the pairwise similarity of these descriptors. The similarity matrix is subsequently fed to a CNN network that captures the temporal structures existing within its content. We train our network following a triplet generation process and optimizing the triplet loss function. To evaluate the effectiveness of the proposed approach, we have manually annotated two publicly available video datasets based on the audio duplicity between their videos. The proposed approach achieves very competitive results compared to three state-of-the-art methods. Also, unlike the competing methods, it is very robust to the retrieval of audio duplicates generated with speed transformations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题