论文标题
使用弱注释的视频字幕
Video Captioning Using Weak Annotation
论文作者
论文摘要
近年来,视频字幕显示出令人印象深刻的进展。现有方法对性能进行改进的关键原因之一在于大量的配对视频句子数据,但收集如此强大的注释,即高质量的句子,是耗时且辛苦的。事实是,现在存在着大量的视频,其中只包含语义概念,例如动作和对象。在本文中,我们使用弱注释而不是强烈注释来训练视频字幕模型。为此,我们提出了一种渐进的视觉推理方法,该方法通过推断更多语义概念及其对视频字幕的依赖关系来逐渐从弱注释中产生精细的句子。为了模拟概念关系,我们使用依赖树通过从大句子语料库中利用外部知识来跨越的依赖树。通过穿越依赖树,生成句子以训练字幕模型。因此,我们开发了一种迭代改进算法,该算法通过依赖性树和微调使用精制句子以替代培训方式来完善句子。实验结果表明,使用弱注释的方法对使用强注释的最新方法非常有竞争力。
Video captioning has shown impressive progress in recent years. One key reason of the performance improvements made by existing methods lie in massive paired video-sentence data, but collecting such strong annotation, i.e., high-quality sentences, is time-consuming and laborious. It is the fact that there now exist an amazing number of videos with weak annotation that only contains semantic concepts such as actions and objects. In this paper, we investigate using weak annotation instead of strong annotation to train a video captioning model. To this end, we propose a progressive visual reasoning method that progressively generates fine sentences from weak annotations by inferring more semantic concepts and their dependency relationships for video captioning. To model concept relationships, we use dependency trees that are spanned by exploiting external knowledge from large sentence corpora. Through traversing the dependency trees, the sentences are generated to train the captioning model. Accordingly, we develop an iterative refinement algorithm that refines sentences via spanning dependency trees and fine-tunes the captioning model using the refined sentences in an alternative training manner. Experimental results demonstrate that our method using weak annotation is very competitive to the state-of-the-art methods using strong annotation.