MIMO就是您所需要的：视频预测的强大多数基线基线

论文标题

MIMO就是您所需要的：视频预测的强大多数基线基线

MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction

论文作者

Ning, Shuliang, Lan, Mengcheng, Li, Yanran, Chen, Chaofeng, Chen, Qian, Chen, Xunlai, Han, Xiaoguang, Cui, Shuguang

论文摘要

视频预测现有方法的主流基于单个单个架构（SISO）体系结构构建其模型，该体系结构将当前框架作为输入以递归方式预测下一帧。这样，当他们试图推断更长的未来时，通常会导致严重的性能下降，从而限制了预测模型的实际使用。另外，一个多中期（MIMO）体系结构一击输出所有未来的帧自然会打破递归方式，因此可以防止误差积累。但是，仅提出了几个用于视频预测的MIMO模型，并且由于日期，它们的性能较低。在该领域，MIMO模型的真正强度尚未得到充分注意，并且在很大程度上还没有探索。由此激励，我们在本文中进行了全面的调查，以彻底利用简单的MIMO架构可以走多远。令人惊讶的是，我们的实证研究表明，一个简单的MIMO模型可以胜过最先进的工作，其利润率远远超过预期，尤其是在处理长期误差积累时。在探索了多种方式和设计之后，我们提出了一种新的MIMO架构，基于使用本地时空块扩展纯变压器和一个新的多输出解码器，即MIMO-VP，以建立视频预测的新标准。我们以四个高度竞争性的基准（移动MNIST，Human 36M，Weather，Kitti）评估我们的模型。广泛的实验表明，我们的模型在所有基准测试中赢得了第一位，并具有显着的性能提高，并超过了各个方面的最佳SISO模型，包括效率，数量和质量。我们认为，我们的模型可以作为促进视频预测任务的未来研究的新基准。代码将发布。

The mainstream of the existing approaches for video prediction builds up their models based on a Single-In-Single-Out (SISO) architecture, which takes the current frame as input to predict the next frame in a recursive manner. This way often leads to severe performance degradation when they try to extrapolate a longer period of future, thus limiting the practical use of the prediction model. Alternatively, a Multi-In-Multi-Out (MIMO) architecture that outputs all the future frames at one shot naturally breaks the recursive manner and therefore prevents error accumulation. However, only a few MIMO models for video prediction are proposed and they only achieve inferior performance due to the date. The real strength of the MIMO model in this area is not well noticed and is largely under-explored. Motivated by that, we conduct a comprehensive investigation in this paper to thoroughly exploit how far a simple MIMO architecture can go. Surprisingly, our empirical studies reveal that a simple MIMO model can outperform the state-of-the-art work with a large margin much more than expected, especially in dealing with longterm error accumulation. After exploring a number of ways and designs, we propose a new MIMO architecture based on extending the pure Transformer with local spatio-temporal blocks and a new multi-output decoder, namely MIMO-VP, to establish a new standard in video prediction. We evaluate our model in four highly competitive benchmarks (Moving MNIST, Human3.6M, Weather, KITTI). Extensive experiments show that our model wins 1st place on all the benchmarks with remarkable performance gains and surpasses the best SISO model in all aspects including efficiency, quantity, and quality. We believe our model can serve as a new baseline to facilitate the future research of video prediction tasks. The code will be released.

下载PDF全文

下载文献需遵守相关版权规定

论文标题