与世界模型的低变化政策梯度估计

论文标题

与世界模型的低变化政策梯度估计

Low-Variance Policy Gradient Estimation with World Models

论文作者

Nauman, Michal, Hengst, Floris Den

论文摘要

在本文中，我们提出了世界模型策略梯度（WMPG），这是一种使用学识渊博的世界模型（WM）来减少策略梯度估计方差的方法。在WMPG中，WM在线训练，并用来想象轨迹。想象中的轨迹以两种方式使用。首先，计算一个没有策略梯度的不替代估计器。其次，想象的轨迹的返回用作知情基线。我们将所提出的方法与AC和MAC进行比较，以提高复杂性的环境（Cartpole，Lunarlander和Pong），并发现WMPG具有更好的样品效率。基于这些结果，我们得出结论，在可以学习强大的潜在环境表示的情况下，WMPG可以提高样品效率。

In this paper, we propose World Model Policy Gradient (WMPG), an approach to reduce the variance of policy gradient estimates using learned world models (WM's). In WMPG, a WM is trained online and used to imagine trajectories. The imagined trajectories are used in two ways. Firstly, to calculate a without-replacement estimator of the policy gradient. Secondly, the return of the imagined trajectories is used as an informed baseline. We compare the proposed approach with AC and MAC on a set of environments of increasing complexity (CartPole, LunarLander and Pong) and find that WMPG has better sample efficiency. Based on these results, we conclude that WMPG can yield increased sample efficiency in cases where a robust latent representation of the environment can be learned.

下载PDF全文

下载文献需遵守相关版权规定

论文标题