与蒙特卡洛树学习的政策梯度算法，用于非马尔科夫决策过程

论文标题

与蒙特卡洛树学习的政策梯度算法，用于非马尔科夫决策过程

Policy Gradient Algorithms with Monte Carlo Tree Learning for Non-Markov Decision Processes

论文作者

Morimura, Tetsuro, Ota, Kazuhiro, Abe, Kenshi, Zhang, Peinan

论文摘要

策略梯度（PG）是一种加强学习（RL）方法，可优化使用梯度上升的预期回报的参数化策略模型。尽管PG即使在非马克维亚环境中也可以很好地工作，但它可能会遇到高原或峰值问题。作为另一种成功的RL方法，基于蒙特卡洛树搜索（MCT）的算法（包括alphazero）已获得了突破性的结果，尤其是在游戏领域中。当应用于非马尔科夫决策过程时，它们也有效。但是，标准MCT是一种决策时间计划的方法，与在线RL设置不同。在这项工作中，我们首先介绍了蒙特卡洛树学习（MCTL），这是用于在线RL设置的MCT的改编。然后，我们探索PG和MCTL的合并政策方法来利用其优势。我们通过两次尺度随机近似的结果得出了渐近收敛的条件，并提出了一种满足这些条件并收敛到合理溶液的算法。我们的数值实验验证了所提出的方法的有效性。

Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. While PG can work well even in non-Markovian environments, it may encounter plateaus or peakiness issues. As another successful RL approach, algorithms based on Monte Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results, especially in the game-playing domain. They are also effective when applied to non-Markov decision processes. However, the standard MCTS is a method for decision-time planning, which differs from the online RL setting. In this work, we first introduce Monte Carlo Tree Learning (MCTL), an adaptation of MCTS for online RL setups. We then explore a combined policy approach of PG and MCTL to leverage their strengths. We derive conditions for asymptotic convergence with the results of a two-timescale stochastic approximation and propose an algorithm that satisfies these conditions and converges to a reasonable solution. Our numerical experiments validate the effectiveness of the proposed methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题