论文标题

具有限制性的情节有限型摩尼子MDP的样品效率算法

A Sample-Efficient Algorithm for Episodic Finite-Horizon MDP with Constraints

论文作者

Kalagarla, Krishna C., Jain, Rahul, Nuzzo, Pierluigi

论文摘要

受约束的马尔可夫决策过程(CMDP)正式化了顺序决策问题,其目的是使成本功能最小化,同时满足各种成本功能的约束。在本文中,我们考虑了情节性固定性CMDP的设置。我们提出了一种在线算法,该算法利用有限的Horizo​​n CMDP的线性编程公式进行了反复的乐观计划,以提供可能需要近似正确的(PAC)保证(PAC)保证,以确保$ε$ -Optimal的策略所需的发作数量,即在$ε$ necter $ε$ sunce $ε$ sunce $ε$ sunders $ε$ sunders $ε$ sunders $ - $ 1-δ$。所需的情节数量显示为$ \ tilde {\ MathCal {o}}} \ big(\ frac {| s || s || a | c^{2} h^{2} h^{2}}} {ε^{ε^{2}}}}} \ frac {1} \ frac {1} in nestions $ cortorry of there of there of there of there pertine of there of there in $ c $ c $ c $州行动对。因此,如果$ c \ ll | s | $,所需的情节数分别对状态和动作空间大小$ | s | $和$ | a | $,以及对时间范围$ h $的二次依赖。

Constrained Markov Decision Processes (CMDPs) formalize sequential decision-making problems whose objective is to minimize a cost function while satisfying constraints on various cost functions. In this paper, we consider the setting of episodic fixed-horizon CMDPs. We propose an online algorithm which leverages the linear programming formulation of finite-horizon CMDP for repeated optimistic planning to provide a probably approximately correct (PAC) guarantee on the number of episodes needed to ensure an $ε$-optimal policy, i.e., with resulting objective value within $ε$ of the optimal value and satisfying the constraints within $ε$-tolerance, with probability at least $1-δ$. The number of episodes needed is shown to be of the order $\tilde{\mathcal{O}}\big(\frac{|S||A|C^{2}H^{2}}{ε^{2}}\log\frac{1}δ\big)$, where $C$ is the upper bound on the number of possible successor states for a state-action pair. Therefore, if $C \ll |S|$, the number of episodes needed have a linear dependence on the state and action space sizes $|S|$ and $|A|$, respectively, and quadratic dependence on the time horizon $H$.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源