论文标题
关于加强学习与蒙特卡洛探索的融合
On the Convergence of Reinforcement Learning with Monte Carlo Exploring Starts
论文作者
论文摘要
基于仿真的基本强化学习算法是蒙特卡洛探索状态(MCES)方法,也称为乐观的策略迭代,其中,通过模拟回报近似值,在每次迭代中选择了贪婪的策略。该算法在一般环境中的收敛是一个空旷的问题。在本文中,我们研究了该算法的融合,该算法的成本未汇总成本,也称为随机最短路径问题。结果补充了有关此主题的现有部分结果,从而有助于进一步解决开放问题。作为一个侧面结果,我们还提供了常用于随机近似中常用的超级趋同定理的版本的证明。
A basic simulation-based reinforcement learning algorithm is the Monte Carlo Exploring States (MCES) method, also known as optimistic policy iteration, in which the value function is approximated by simulated returns and a greedy policy is selected at each iteration. The convergence of this algorithm in the general setting has been an open question. In this paper, we investigate the convergence of this algorithm for the case with undiscounted costs, also known as the stochastic shortest path problem. The results complement existing partial results on this topic and thereby helps further settle the open problem. As a side result, we also provide a proof of a version of the supermartingale convergence theorem commonly used in stochastic approximation.