关于马尔可夫决策过程中的Skolem-Hards和饱和点

论文标题

关于马尔可夫决策过程中的Skolem-Hards和饱和点

On Skolem-hardness and saturation points in Markov decision processes

论文作者

Piribauer, Jakob, Baier, Christel

论文摘要

线性复发序列的Skolem问题和相关的阳性问题是杰出的数字理论问题，其可决定性已经开放了数十年。在本文中，马尔可夫决策过程（MDP）的一系列优化问题的固有数学难度是通过从积极问题到相关决策问题的减少来表明的，这些决策问题表明，这些问题至少与Skolem问题一样困难。所考虑的优化问题是根据预期的部分或条件累积的重量的两种非古典路径问题（SSPP）的两个非经典变化，对累积重量的有条件价值危险的优化以及解决路径属性的长期满意度的两个问题，即常规频率属性和模型的优化型号和模型的优化。为了证明后两个问题的阳性 - 舒适性，引入了一种新的辅助路径度量，称为加权长期频率，并引入了相应决策问题的阳性率作为中间步骤。对于具有非负重的MDP的部分和条件SSPP，以及优化约束可及性属性（A U B）的长期概率（A U B），已知解决方案依赖于对累积的重量或连续访问的识别的识别，或者是对某些SATE的连续访问数量（称为饱和点，称为饱和点，从而对最佳记忆不记忆力进行了效果。在本文中，还表明，在具有非负权重的MDP上，对经典SSPP的条件值危险的优化以及在伪造的时间内求解了具有非负权重的长期频率。

The Skolem problem and the related Positivity problem for linear recurrence sequences are outstanding number-theoretic problems whose decidability has been open for many decades. In this paper, the inherent mathematical difficulty of a series of optimization problems on Markov decision processes (MDPs) is shown by a reduction from the Positivity problem to the associated decision problems which establishes that the problems are also at least as hard as the Skolem problem as an immediate consequence. The optimization problems under consideration are two non-classical variants of the stochastic shortest path problem (SSPP) in terms of expected partial or conditional accumulated weights, the optimization of the conditional value-at-risk for accumulated weights, and two problems addressing the long-run satisfaction of path properties, namely the optimization of long-run probabilities of regular co-safety properties and the model-checking problem of the logic frequency-LTL. To prove the Positivity- and hence Skolem-hardness for the latter two problems, a new auxiliary path measure, called weighted long-run frequency, is introduced and the Positivity-hardness of the corresponding decision problem is shown as an intermediate step. For the partial and conditional SSPP on MDPs with non-negative weights and for the optimization of long-run probabilities of constrained reachability properties (a U b), solutions are known that rely on the identification of a bound on the accumulated weight or the number of consecutive visits to certain sates, called a saturation point, from which on optimal schedulers behave memorylessly. In this paper, it is shown that also the optimization of the conditional value-at-risk for the classical SSPP and of weighted long-run frequencies on MDPs with non-negative weights can be solved in pseudo-polynomial time exploiting the existence of a saturation point.

下载PDF全文

下载文献需遵守相关版权规定

论文标题