CMDP和对抗性损失的上置信处原始的双重加固学习

论文标题

CMDP和对抗性损失的上置信处原始的双重加固学习

Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss

论文作者

Qiu, Shuang, Wei, Xiaohan, Yang, Zhuoran, Ye, Jieping, Wang, Zhaoran

论文摘要

我们考虑在线学习，以随机限制的马尔可夫决策过程（CMDP），该过程在确保强化学习的安全性方面起着核心作用。在这里，损失函数在整个情节中都可以任意变化，并且在每个情节结束时都揭示了收到的损失和预算消耗。先前的工作解决了马尔可夫决策过程（MDP）的过渡模型的限制性假设，并确定了遗憾的界限，并建立了多个依赖状态空间$ \ mathcal $ \ Mathcal {s} $和动作空间$ \ MATHCAL {a} $的遗憾界限。在这项工作中，我们提出了一种新的\ emph {上置信度原始dual}算法，该算法仅需要从过渡模型中采样的轨迹。特别是，我们证明提出的算法达到了$ \ widetilde {\ Mathcal {o}}}（l | \ Mathcal {s} | \ sqrt {| \ sqrt {| \ Mathcal {a} | t} | t} | t} | t} | t} | t} | t} | t} | t} |我们的分析结合了对Lagrange乘数流程的新的高概率漂移分析，以对上置信度增强学习的著名遗憾分析，这证明了在受约束的在线学习中“乐观的力量”。

We consider online learning for episodic stochastically constrained Markov decision processes (CMDPs), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the episodes, and both the loss received and the budget consumption are revealed at the end of each episode. Previous works solve this problem under the restrictive assumption that the transition model of the Markov decision processes (MDPs) is known a priori and establish regret bounds that depend polynomially on the cardinalities of the state space $\mathcal{S}$ and the action space $\mathcal{A}$. In this work, we propose a new \emph{upper confidence primal-dual} algorithm, which only requires the trajectories sampled from the transition model. In particular, we prove that the proposed algorithm achieves $\widetilde{\mathcal{O}}(L|\mathcal{S}|\sqrt{|\mathcal{A}|T})$ upper bounds of both the regret and the constraint violation, where $L$ is the length of each episode. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning, which demonstrates the power of "optimism in the face of uncertainty" in constrained online learning.

下载PDF全文

下载文献需遵守相关版权规定

论文标题