POMDP的强化学习：分区的推出和政策迭代，并应用于自主连续维修问题

论文标题

POMDP的强化学习：分区的推出和政策迭代，并应用于自主连续维修问题

Reinforcement Learning for POMDP: Partitioned Rollout and Policy Iteration with Application to Autonomous Sequential Repair Problems

论文作者

Bhattacharya, Sushmita, Badyal, Sahil, Wheeler, Thomas, Gil, Stephanie, Bertsekas, Dimitri

论文摘要

在本文中，我们考虑了无限的地平线与有限状态和控制空间以及部分状态观察的无限动态编程问题。我们讨论了一种使用多步lookahead的算法，具有已知基本策略的截断推出以及终端成本函数近似。该算法还用于在近似政策迭代方案中进行策略改进，其中使用神经网络分类器近似策略。我们方法的一个新颖特征是，它非常适合通过扩展的信念空间公式和使用分区体系结构进行分布式计算，该建筑经过多个神经网络训练。我们将方法应用于模拟的一类顺序修复问题，在这些问题中，机器人在有关管道状态的部分信息下检查并修复了带有几个破裂位点的管道。

In this paper we consider infinite horizon discounted dynamic programming problems with finite state and control spaces, and partial state observations. We discuss an algorithm that uses multistep lookahead, truncated rollout with a known base policy, and a terminal cost function approximation. This algorithm is also used for policy improvement in an approximate policy iteration scheme, where successive policies are approximated by using a neural network classifier. A novel feature of our approach is that it is well suited for distributed computation through an extended belief space formulation and the use of a partitioned architecture, which is trained with multiple neural networks. We apply our methods in simulation to a class of sequential repair problems where a robot inspects and repairs a pipeline with potentially several rupture sites under partial information about the state of the pipeline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题