在线监督的在线滚动 - 摩恩控制无限 - 摩恩的折扣马尔可夫决策过程

论文标题

在线监督的在线滚动 - 摩恩控制无限 - 摩恩的折扣马尔可夫决策过程

On Supervised On-line Rolling-Horizon Control for Infinite-Horizon Discounted Markov Decision Processes

论文作者

Chang, Hyeong Soo

论文摘要

本说明重新调查了马尔可夫决策过程（MDP）的滚动 - 摩恩控制方法，而无限 - 马森折扣了预期的奖励标准。与经典的价值介绍方法区别开来，我们基于与多政策改进的策略切换方法集成的政策迭代，开发了一种异步在线算法。通过仅在当前访问的状态上更新当前解决方案来生成一系列单调改进的预测 - Horizon子MDP的解决方案，实际上是在Infinite Horizon上建立了MDP的滚动度 - Horizon控制策略。在更新时也可以合并“主管”的反馈。我们专注于与MDP的过渡结构有关的收敛问题。根据结构，算法可以在有限的时间内融合到最佳预测 - 摩尼子策略，或者在有限时间内与“本地最佳”固定政策的局部收敛性，具体取决于结构。

This note re-visits the rolling-horizon control approach to the problem of a Markov decision process (MDP) with infinite-horizon discounted expected reward criterion. Distinguished from the classical value-iteration approach, we develop an asynchronous on-line algorithm based on policy iteration integrated with a multi-policy improvement method of policy switching. A sequence of monotonically improving solutions to the forecast-horizon sub-MDP is generated by updating the current solution only at the currently visited state, building in effect a rolling-horizon control policy for the MDP over infinite horizon. Feedbacks from "supervisors," if available, can be also incorporated while updating. We focus on the convergence issue with a relation to the transition structure of the MDP. Either a global convergence to an optimal forecast-horizon policy or a local convergence to a "locally-optimal" fixed-policy in a finite time is achieved by the algorithm depending on the structure.

下载PDF全文

下载文献需遵守相关版权规定

论文标题