具有汇总状态的政策梯度方法的近似益处

论文标题

具有汇总状态的政策梯度方法的近似益处

Approximation Benefits of Policy Gradient Methods with Aggregated States

论文作者

Russo, Daniel

论文摘要

民间传说表明，政策梯度比其相对，近似政策迭代更为强大。本文研究了国家聚集表示的案例，该案例是对状态空间进行分区的情况，并且在分区上保持了策略或价值函数近似的状态。本文显示了一种策略梯度方法收敛到策略，该策略的遗憾是由$ε$界定的，$ε$是属于公共分区的国家行动值函数的两个要素之间的最大差异。通过相同的表示，近似政策迭代和近似价值迭代都可以产生政策，其遗憾量表为$ε/（1-γ）$，其中$γ$是折扣因子。面对固有的近似误差，局部优化真实决策目标的方法可以更加稳健。

Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by $ε$, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as $ε/(1-γ)$, where $γ$ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision-objective can be far more robust.

下载PDF全文

下载文献需遵守相关版权规定

论文标题