论文标题
受约束的MDP中的勘探探索探索
Exploration-Exploitation in Constrained MDPs
论文作者
论文摘要
在许多顺序决策问题中,目标是在满足不同实用程序的一系列约束的同时,优化实用程序功能。该学习问题是通过受约束的马尔可夫决策过程(CMDP)形式化的。在本文中,我们研究了CMDPS中的探索探索困境。在学习未知的CMDP时,代理应该权衡探索以发现有关MDP的新信息,并剥削当前的知识以最大程度地提高奖励,同时满足约束。尽管代理商最终将学习一项良好或最佳的政策,但我们不希望代理在学习过程中频繁违反约束。在这项工作中,我们分析了CMDP中学习的两种方法。第一种方法利用CMDP的线性公式在每个情节上执行乐观计划。第二种方法利用CMDP的双公式(或鞍点公式)来执行原始变量和双重变量的增量,乐观的更新。我们表明,两者都会在对约束违规行为感到遗憾的同时,在主要的效用中实现了Sublinear后悔的遗憾。话虽如此,我们强调了两种方法之间的关键差异。线性编程方法比基于双重配方的方法可提供更强的保证。
In many sequential decision-making problems, the goal is to optimize a utility function while satisfying a set of constraints on different utilities. This learning problem is formalized through Constrained Markov Decision Processes (CMDPs). In this paper, we investigate the exploration-exploitation dilemma in CMDPs. While learning in an unknown CMDP, an agent should trade-off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward while satisfying the constraints. While the agent will eventually learn a good or optimal policy, we do not want the agent to violate the constraints too often during the learning process. In this work, we analyze two approaches for learning in CMDPs. The first approach leverages the linear formulation of CMDP to perform optimistic planning at each episode. The second approach leverages the dual formulation (or saddle-point formulation) of CMDP to perform incremental, optimistic updates of the primal and dual variables. We show that both achieves sublinear regret w.r.t.\ the main utility while having a sublinear regret on the constraint violations. That being said, we highlight a crucial difference between the two approaches; the linear programming approach results in stronger guarantees than in the dual formulation based approach.