克服强大的离线深入强化学习的模型偏见

论文标题

克服强大的离线深入强化学习的模型偏见

Overcoming Model Bias for Robust Offline Deep Reinforcement Learning

论文作者

Swazinna, Phillip, Udluft, Steffen, Runkler, Thomas

论文摘要

最先进的强化学习算法主要依赖于被允许直接与其环境互动以收集数百万个观察结果。这使得很难将他们的成功转移到工业控制问题上，因为模拟通常非常昂贵或不存在，并且在真实的环境中探索可能会导致灾难性事件。最近开发的无模型，离线RL算法可以通过减轻值函数的外推误差来从单个数据集（包含有限的探索）中学习。但是，训练过程的鲁棒性仍然相对较低，这是使用价值函数的方法已知的问题。为了提高学习过程的鲁棒性和稳定性，我们使用动态模型来评估策略性能而不是价值功能，从而导致Moose（基于模型的离线策略搜索与合奏），这是一种算法，该算法通过将策略保留在数据范围内，从而确保较低的模型偏见。我们将驼鹿与最先进的模型，离线RL算法{BRAC，}熊和BCQ在工业基准和Mujoco连续控制任务上的稳健性能进行了比较，并且发现Moose在几乎所有考虑的情况下，甚至遥远的情况下，Moose均超过了其模型的不含等级。

State-of-the-art reinforcement learning algorithms mostly rely on being allowed to directly interact with their environment to collect millions of observations. This makes it hard to transfer their success to industrial control problems, where simulations are often very costly or do not exist, and exploring in the real environment can potentially lead to catastrophic events. Recently developed, model-free, offline RL algorithms, can learn from a single dataset (containing limited exploration) by mitigating extrapolation error in value functions. However, the robustness of the training process is still comparatively low, a problem known from methods using value functions. To improve robustness and stability of the learning process, we use dynamics models to assess policy performance instead of value functions, resulting in MOOSE (MOdel-based Offline policy Search with Ensembles), an algorithm which ensures low model bias by keeping the policy within the support of the data. We compare MOOSE with state-of-the-art model-free, offline RL algorithms { BRAC,} BEAR and BCQ on the Industrial Benchmark and MuJoCo continuous control tasks in terms of robust performance, and find that MOOSE outperforms its model-free counterparts in almost all considered cases, often even by far.

下载PDF全文

下载文献需遵守相关版权规定

论文标题