大量POMDP不需要未来的回忆

论文标题

大量POMDP不需要未来的回忆

Future memories are not needed for large classes of POMDPs

论文作者

Cohen, Victor, Parmentier, Axel

论文摘要

部分观察到的马尔可夫决策过程（POMDP）的最佳策略是历史依赖的：根据整个观察史做出决策。在文献中通常认为无记忆的策略仅基于上次观察，这些策略通常被认为是无用的，因为我们可以构建POMDP实例，其中最佳的无记忆策略任意比依赖历史的策略更糟糕。我们的目的是挑战这种信念。我们表明，可以使用混合整数线性编程（MILP）有效地计算出最佳的无内存策略，并在文献的各种实例上表现出色。当通过有效的不平等加强时，该MILP的线性放松为最佳历史依赖政策的价值提供了高质量的上限。此外，当与有限的Horizon POMDP问题一起使用，将无内存策略作为滚动优化问题，模型预测控制方法会导致有效的历史依赖性策略，我们将其称为“未来短暂内存”（SMF）策略。基本上，SMF策略利用这些无内存的策略来构建Bellman值函数的近似。数值实验表明了我们方法对文献的基准实例的效率。

Optimal policies for partially observed Markov decision processes (POMDPs) are history-dependent: Decisions are made based on the entire history of observation. Memoryless policies, which take decisions based on the last observation only, are generally considered useless in the literature because we can construct POMDP instances for which optimal memoryless policies are arbitrarily worse than history-dependent ones. Our purpose is to challenge this belief. We show that optimal memoryless policies can be computed efficiently using mixed integer linear programming (MILP), and perform reasonably well on a wide range of instances from the literature. When strengthened with valid inequalities, the linear relaxation of this MILP provides high quality upper-bounds on the value of an optimal history dependent policy. Furthermore, when used with a finite horizon POMDP problem with memoryless policies as rolling optimization problem, a model predictive control approach leads to an efficient history-dependent policy, which we call the short memory in the future (SMF) policy. Basically, the SMF policy leverages these memoryless policies to build an approximation of the Bellman value function. Numerical experiments show the efficiency of our approach on benchmark instances from the literature.

下载PDF全文

下载文献需遵守相关版权规定

论文标题