论文标题
模型,像素和奖励:评估基于视觉模型的强化学习中的设计权衡
Models, Pixels, and Rewards: Evaluating Design Trade-offs in Visual Model-Based Reinforcement Learning
论文作者
论文摘要
基于模型的增强学习(MBRL)方法已经显示出各种任务的样本效率和性能很强,包括面对高维视觉观察。这些方法学会从交互中预测环境动态和预期奖励,并使用此预测模型计划和执行任务。但是,MBRL方法的基本设计选择各不相同,并且关于这些设计决策如何影响性能的文献中没有强烈的共识。在本文中,我们研究了视觉MBRL算法中预测模型的许多设计决策,专门针对使用预测模型进行计划的方法。我们发现,一系列通常被认为是至关重要的设计决策,例如使用潜在空间,对任务绩效几乎没有影响。这一发现的一个很大的例外是,与仅预测奖励相比,预测未来观察结果(即图像)会导致大幅度的任务绩效改善。我们还从经验上发现,图像预测准确性(有点令人惊讶)与下游任务性能更加密切,而不是奖励预测准确性。我们展示了这种现象与探索的关系以及标准基准测试基准(需要探索)上的一些较低得分模型如何在接受相同训练数据训练时的表现最佳模型。同时,在没有探索的情况下,适合数据更好的模型通常也可以在下游任务上表现得更好,但是令人惊讶的是,这些模型通常与从头开始学习和探索时表现最好的模型通常不是相同的模型。这些发现表明,性能和探索使模型上重要的和潜在的矛盾要求。
Model-based reinforcement learning (MBRL) methods have shown strong sample efficiency and performance across a variety of tasks, including when faced with high-dimensional visual observations. These methods learn to predict the environment dynamics and expected reward from interaction and use this predictive model to plan and perform the task. However, MBRL methods vary in their fundamental design choices, and there is no strong consensus in the literature on how these design decisions affect performance. In this paper, we study a number of design decisions for the predictive model in visual MBRL algorithms, focusing specifically on methods that use a predictive model for planning. We find that a range of design decisions that are often considered crucial, such as the use of latent spaces, have little effect on task performance. A big exception to this finding is that predicting future observations (i.e., images) leads to significant task performance improvement compared to only predicting rewards. We also empirically find that image prediction accuracy, somewhat surprisingly, correlates more strongly with downstream task performance than reward prediction accuracy. We show how this phenomenon is related to exploration and how some of the lower-scoring models on standard benchmarks (that require exploration) will perform the same as the best-performing models when trained on the same training data. Simultaneously, in the absence of exploration, models that fit the data better usually perform better on the downstream task as well, but surprisingly, these are often not the same models that perform the best when learning and exploring from scratch. These findings suggest that performance and exploration place important and potentially contradictory requirements on the model.