基于变异模型的策略优化

论文标题

基于变异模型的策略优化

Variational Model-based Policy Optimization

论文作者

Chow, Yinlam, Cui, Brandon, Ryu, MoonKyung, Ghavamzadeh, Mohammad

论文摘要

基于模型的强化学习（RL）算法使我们能够将模型生成的数据与与实际系统相互作用收集的数据结合在一起，以减轻RL中的数据效率问题。但是，设计这种算法通常是具有挑战性的，因为模拟数据中的偏差可能会掩盖数据生成的便利性。应对这一挑战的潜在解决方案是使用普遍目标功能共同学习和改善模型和政策。在本文中，我们利用了RL和概率推断之间的联系，并将其作为对数似然的变分低结合的目标函数。这使我们能够使用期望最大化（EM）并迭代地修复基线策略，并学习由模型和策略（E-Step）组成的变异分布，然后在鉴于学习的变分分布（M-Step）的情况下改善了基线策略。我们为E-Step提出了基于模型的和无模型的政策迭代（Actor-Critic）样式算法，并展示如何使用它们所学的变异分布来以完全基于模型的方式优化M-Step。我们对许多连续控制任务的实验表明，尽管更复杂，但我们的基于模型的（E-step）算法，称为{\ em基于基于变异的模型策略优化}（VMBPO），比其无模型（E-step）对方更具有样本效率，对超参数调整。使用相同的控制任务，我们还将VMBPO与几种基于模型和无模型的RL算法进行比较，并显示其样品效率和性能。

Model-based reinforcement learning (RL) algorithms allow us to combine model-generated data with those collected from interaction with the real system in order to alleviate the data efficiency problem in RL. However, designing such algorithms is often challenging because the bias in simulated data may overshadow the ease of data generation. A potential solution to this challenge is to jointly learn and improve model and policy using a universal objective function. In this paper, we leverage the connection between RL and probabilistic inference, and formulate such an objective function as a variational lower-bound of a log-likelihood. This allows us to use expectation maximization (EM) and iteratively fix a baseline policy and learn a variational distribution, consisting of a model and a policy (E-step), followed by improving the baseline policy given the learned variational distribution (M-step). We propose model-based and model-free policy iteration (actor-critic) style algorithms for the E-step and show how the variational distribution learned by them can be used to optimize the M-step in a fully model-based fashion. Our experiments on a number of continuous control tasks show that despite being more complex, our model-based (E-step) algorithm, called {\em variational model-based policy optimization} (VMBPO), is more sample-efficient and robust to hyper-parameter tuning than its model-free (E-step) counterpart. Using the same control tasks, we also compare VMBPO with several state-of-the-art model-based and model-free RL algorithms and show its sample efficiency and performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题