具有限制性的情节有限型摩尼子MDP的样品效率算法

论文标题

具有限制性的情节有限型摩尼子MDP的样品效率算法

A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints

论文作者

Kalagarla, Krishna C., Jain, Rahul, Nuzzo, Pierluigi

论文摘要

受约束的马尔可夫决策过程（CMDP）正式化了顺序决策问题，其目的是使成本功能最小化，同时满足各种成本功能的约束。在本文中，我们考虑了情节性固定性CMDP的设置。我们提出了一种在线算法，该算法利用有限的Horizon CMDP的线性编程公式进行了反复的乐观计划，以提供可能需要近似正确的（PAC）保证（PAC）保证，以确保$ε$ -Optimal的策略所需的发作数量，即在$ε$ necter $ε$ sunce $ε$ sunce $ε$ sunders $ε$ sunders $ε$ sunders $ - $ 1-δ$。所需的情节数量显示为$ \ tilde {\ MathCal {o}}} \ big（\ frac {| s || s || a | c^{2} h^{2} h^{2}}} {ε^{ε^{2}}}}} \ frac {1} \ frac {1} in nestions $ cortorry of there of there of there of there pertine of there of there in $ c $ c $ c $州行动对。因此，如果$ c \ ll | s | $，所需的情节数分别对状态和动作空间大小$ | s | $和$ | a | $，以及对时间范围$ h $的二次依赖。

Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems whose objective is to minimize a cost function while satisfying constraints on various cost functions. In this paper, we consider the setting of episodic fixed-horizon CMDPs. We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an $ε$-optimal policy, i.e., with resulting objective value within $ε$ of the optimal value and satisfying the constraints within $ε$-tolerance, with probability at least $1-δ$. The number of episodes needed is shown to be of the order $\tilde{\mathcal{O}}\big(\frac{|S||A|C^{2}H^{2}}{ε^{2}}\log\frac{1}δ\big)$, where $C$ is the upper bound on the number of possible successor states for a state-action pair. Therefore, if $C \ll |S|$, the number of episodes needed have a linear dependence on the state and action space sizes $|S|$ and $|A|$, respectively, and quadratic dependence on the time horizon $H$.

下载PDF全文

下载文献需遵守相关版权规定

论文标题