在部分观察到的马尔可夫决策过程中，有限记忆反馈政策的几乎最佳性

论文标题

在部分观察到的马尔可夫决策过程中，有限记忆反馈政策的几乎最佳性

Near Optimality of Finite Memory Feedback Policies in Partially Observed Markov Decision Processes

论文作者

Kara, Ali Devran, Yuksel, Serdar

论文摘要

在部分观察到的马尔可夫决策过程（POMDP）的理论中，通常通过将原始观察到的随机控制问题转换为在信仰空间上完全观察到的原始观察到的最佳策略的存在，从而导致了信仰MDP。但是，即使原始系统具有有限的状态和动作空间，使用经典的动态或线性编程方法计算该完全观察到的模型的最佳策略，因此，使用经典的动态或线性编程方法，因为完全观察到的信念-MDP模型的状态空间总是无法建立的，这是有限的。此外，很少有严格的价值函数近似值和最佳策略近似结果，因为所需的规律性条件通常需要进行乏味的研究，涉及概率措施的空间，导致属性等属性，例如霉菌的连续性。在本文中，我们研究了POMDP的计划问题，其中假定系统动力学和测量通道模型已知。我们通过仅使用有限窗口信息变量离散信念空间来构建一个近似信念模型。然后，我们找到了近似模型的最佳策略，在轻度非线性滤波器稳定性条件下，我们严格地建立了POMDP中构建的有限窗口控制策略的最佳性，并且假设测量和动作集是有限的（并且状态空间是实际的矢量值）。我们还建立了收敛结果的速率，该结果将有限的窗口存储器大小和近似误差绑定，其中收敛速率在明确且可测试的指数滤波器稳定性条件下是指数级。虽然存在许多实验结果和很少的严格渐近收敛结果，但据我们所知，文献中的明确收敛率是新的。

In the theory of Partially Observed Markov Decision Processes (POMDPs), existence of optimal policies have in general been established via converting the original partially observed stochastic control problem to a fully observed one on the belief space, leading to a belief-MDP. However, computing an optimal policy for this fully observed model, and so for the original POMDP, using classical dynamic or linear programming methods is challenging even if the original system has finite state and action spaces, since the state space of the fully observed belief-MDP model is always uncountable. Furthermore, there exist very few rigorous value function approximation and optimal policy approximation results, as regularity conditions needed often require a tedious study involving the spaces of probability measures leading to properties such as Feller continuity. In this paper, we study a planning problem for POMDPs where the system dynamics and measurement channel model are assumed to be known. We construct an approximate belief model by discretizing the belief space using only finite window information variables. We then find optimal policies for the approximate model and we rigorously establish near optimality of the constructed finite window control policies in POMDPs under mild non-linear filter stability conditions and the assumption that the measurement and action sets are finite (and the state space is real vector valued). We also establish a rate of convergence result which relates the finite window memory size and the approximation error bound, where the rate of convergence is exponential under explicit and testable exponential filter stability conditions. While there exist many experimental results and few rigorous asymptotic convergence results, an explicit rate of convergence result is new in the literature, to our knowledge.

下载PDF全文

下载文献需遵守相关版权规定

论文标题