RAMBO-RL：强大的基于对抗模型的离线加固学习

论文标题

RAMBO-RL：强大的基于对抗模型的离线加固学习

RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning

论文作者

Rigter, Marc, Lacerda, Bruno, Hawes, Nick

论文摘要

离线增强学习（RL）旨在从已记录的数据中找到表现策略，而无需进一步的环境互动。基于模型的算法从数据集中学习了环境模型并在该模型中执行保守的策略优化，它已成为解决此问题的一种有前途的方法。在这项工作中，我们介绍了基于对抗性模型的强大离线RL（RAMBO），这是一种基于模型的离线RL的新方法。我们将问题作为对抗环境模型的两个玩家零和游戏。该模型经过训练以最大程度地减少价值函数，同时仍能准确预测数据集中的过渡，从而迫使该策略在数据集未覆盖的区域中保守行动。要大致解决两个玩家游戏，我们在优化策略和对抗性优化模型之间进行交替。我们解决的问题公式是理论上的，这可能导致可能是正确的（PAC）性能保证和悲观的价值函数，该功能降低了真实环境中的值函数。我们评估了广泛研究的离线RL基准测试的方法，并证明它表现优于现有的最新基准。

Offline reinforcement learning (RL) aims to find performant policies from logged data without further environment interaction. Model-based algorithms, which learn a model of the environment from the dataset and perform conservative policy optimisation within that model, have emerged as a promising approach to this problem. In this work, we present Robust Adversarial Model-Based Offline RL (RAMBO), a novel approach to model-based offline RL. We formulate the problem as a two-player zero sum game against an adversarial environment model. The model is trained to minimise the value function while still accurately predicting the transitions in the dataset, forcing the policy to act conservatively in areas not covered by the dataset. To approximately solve the two-player game, we alternate between optimising the policy and adversarially optimising the model. The problem formulation that we address is theoretically grounded, resulting in a probably approximately correct (PAC) performance guarantee and a pessimistic value function which lower bounds the value function in the true environment. We evaluate our approach on widely studied offline RL benchmarks, and demonstrate that it outperforms existing state-of-the-art baselines.

下载PDF全文

下载文献需遵守相关版权规定

论文标题