论文标题

关于蒙特卡洛探索的收敛性,开始算法,用于增强学习

On the Convergence of the Monte Carlo Exploring Starts Algorithm for Reinforcement Learning

论文作者

Wang, Che, Yuan, Shuhan, Shao, Kai, Ross, Keith

论文摘要

用于加固学习(RL)的简单自然算法是蒙特卡洛探索开始(MCES),其中Q功能是通过平均蒙特卡洛回报量来估算的,并且通过选择最大化Q-unction量估计的动作来改善策略。探索是通过“探索开始”进行的,即,每个情节以随机选择的状态和动作开始,然后遵循当前的策略到终端状态。在Sutton&Barto(2018)的RL经典书中,据指出,建立MCES算法的融合是RL中最重要的剩余理论问题之一。但是,MCE的融合问题证明是非常细微的。 Bertsekas&Tsitsiklis(1996)提供了一个反例,表明MCES算法不一定会收敛。 Tsitsiklis(2002)进一步表明,如果对原始MCES算法进行了修改,以使Q-功能估计值以所有状态效应对以相同的速率更新,并且折现因子严格少于一个,则MCES算法会收敛。在本文中,我们通过Sutton&Barto(1998)中给出的原始,更有效的MCES算法取得进展,确立了最佳策略前馈MDP的几乎确定的融合,这些MDP是使用最佳策略在任何一集中都不会在任何一集中重新审视的MDP。这样的MDP包括大量的环境,例如所有确定性环境和所有具有时间步长的情节环境或作为状态的任何单调变化的值。与以前使用随机近似的证据不同,我们引入了一种新型的感应方法,该方法非常简单,仅利用大量的强规律。

A simple and natural algorithm for reinforcement learning (RL) is Monte Carlo Exploring Starts (MCES), where the Q-function is estimated by averaging the Monte Carlo returns, and the policy is improved by choosing actions that maximize the current estimate of the Q-function. Exploration is performed by "exploring starts", that is, each episode begins with a randomly chosen state and action, and then follows the current policy to the terminal state. In the classic book on RL by Sutton & Barto (2018), it is stated that establishing convergence for the MCES algorithm is one of the most important remaining open theoretical problems in RL. However, the convergence question for MCES turns out to be quite nuanced. Bertsekas & Tsitsiklis (1996) provide a counter-example showing that the MCES algorithm does not necessarily converge. Tsitsiklis (2002) further shows that if the original MCES algorithm is modified so that the Q-function estimates are updated at the same rate for all state-action pairs, and the discount factor is strictly less than one, then the MCES algorithm converges. In this paper we make headway with the original and more efficient MCES algorithm given in Sutton & Barto (1998), establishing almost sure convergence for Optimal Policy Feed-Forward MDPs, which are MDPs whose states are not revisited within any episode when using an optimal policy. Such MDPs include a large class of environments such as all deterministic environments and all episodic environments with a timestep or any monotonically changing values as part of the state. Different from the previous proofs using stochastic approximations, we introduce a novel inductive approach, which is very simple and only makes use of the strong law of large numbers.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源