论文标题

具有汇总状态的政策梯度方法的近似益处

Approximation Benefits of Policy Gradient Methods with Aggregated States

论文作者

Russo, Daniel

论文摘要

民间传说表明,政策梯度比其相对,近似政策迭代更为强大。本文研究了国家聚集表示的案例,该案例是对状态空间进行分区的情况,并且在分区上保持了策略或价值函数近似的状态。本文显示了一种策略梯度方法收敛到策略,该策略的遗憾是由$ε$界定的,$ε$是属于公共分区的国家行动值函数的两个要素之间的最大差异。通过相同的表示,近似政策迭代和近似价值迭代都可以产生政策,其遗憾量表为$ε/(1-γ)$,其中$γ$是折扣因子。面对固有的近似误差,局部优化真实决策目标的方法可以更加稳健。

Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $ε$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $ε/(1-γ)$, where $γ$ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision-objective can be far more robust.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源