改进的探索中的平均奖励MDP

论文标题

改进的探索中的平均奖励MDP

Improved Exploration in Factored Average-Reward MDPs

论文作者

Talebi, Mohammad Sadegh, Jonsson, Anders, Maillard, Odalric-Ambrym

论文摘要

我们考虑在未知的马尔可夫决策过程（FMDP）中的平均奖励标准下的遗憾最小化任务。更具体地说，我们考虑一个FMDP，其中国家行动空间$ \ MATHCAL X $和状态空间$ \ Mathcal s $承认$ \ Mathcal x = \ otimes_ {i = 1}^n \ Mathcal x_i $ and $ \ Mathcal S = \ Mathcal S = \ \ \ \ \ \ otimes_和Mathcal x = \ otimes_ {i = 1}奖励功能以$ \ MATHCAL X $和$ \ MATHCAL S $为数。假设已知的分解结构，我们引入了一种新型的遗憾最小化策略，其灵感来自流行的UCRL2策略，即DBN-UCRL，该策略依赖于伯恩斯坦型置信度集合为过渡函数的各个元素定义的。我们表明，对于一种通用的分解结构，DBN-UCRL实现了遗憾的束缚，其领先术语严格改善了对$ \ Mathcal S_i $的依赖性和涉及直径相关的术语的依赖的现有遗憾界限。我们进一步表明，当分解结构与某些基本MDP的笛卡尔产物相对应时，DBN-UCRL的遗憾是由基本MDP的遗憾总和上限。通过对标准环境的数值实验，我们证明了DBN-UCRL在经验上对经常遗憾保证的现有算法的遗憾得到了显着改善。

We consider a regret minimization task under the average-reward criterion in an unknown Factored Markov Decision Process (FMDP). More specifically, we consider an FMDP where the state-action space $\mathcal X$ and the state-space $\mathcal S$ admit the respective factored forms of $\mathcal X = \otimes_{i=1}^n \mathcal X_i$ and $\mathcal S=\otimes_{i=1}^m \mathcal S_i$, and the transition and reward functions are factored over $\mathcal X$ and $\mathcal S$. Assuming known factorization structure, we introduce a novel regret minimization strategy inspired by the popular UCRL2 strategy, called DBN-UCRL, which relies on Bernstein-type confidence sets defined for individual elements of the transition function. We show that for a generic factorization structure, DBN-UCRL achieves a regret bound, whose leading term strictly improves over existing regret bounds in terms of the dependencies on the size of $\mathcal S_i$'s and the involved diameter-related terms. We further show that when the factorization structure corresponds to the Cartesian product of some base MDPs, the regret of DBN-UCRL is upper bounded by the sum of regret of the base MDPs. We demonstrate, through numerical experiments on standard environments, that DBN-UCRL enjoys substantially improved regret empirically over existing algorithms that have frequentist regret guarantees.

下载PDF全文

下载文献需遵守相关版权规定

论文标题