在货运中，MDP中的强化学习：非剧本设置的甲骨文效率算法和更严格的遗憾范围

论文标题

在货运中，MDP中的强化学习：非剧本设置的甲骨文效率算法和更严格的遗憾范围

Reinforcement Learning in Factored MDPs: Oracle-Efficient Algorithms and Tighter Regret Bounds for the Non-Episodic Setting

论文作者

Xu, Ziping, Tewari, Ambuj

论文摘要

我们研究了非偶然的马尔可夫决策过程中的强化学习（FMDPS）。我们为FMDPS提出了两种近距离和甲骨文效率的算法。假设Oracle访问了FMDP计划者，他们分别享受贝叶斯和常见的遗憾，这两者都将标准非成分的MDP的近距离绑定$ \ wideTilde {o}（ds \ sqrt {at}）$。我们为FMDPS提出了一个更严格的连接度量，即分类跨度，并证明了一个取决于分量跨度的下限，而不是直径$ d $。为了减少上限和上限之间的差距，我们提出了对Regal.c算法的适应，其遗憾的结合取决于分类跨度。我们的Oracle效率算法先前在计算机网络管理模拟上的近乎最佳的算法优于先前提出的近似算法。

We study reinforcement learning in non-episodic factored Markov decision processes (FMDPs). We propose two near-optimal and oracle-efficient algorithms for FMDPs. Assuming oracle access to an FMDP planner, they enjoy a Bayesian and a frequentist regret bound respectively, both of which reduce to the near-optimal bound $\widetilde{O}(DS\sqrt{AT})$ for standard non-factored MDPs. We propose a tighter connectivity measure, factored span, for FMDPs and prove a lower bound that depends on the factored span rather than the diameter $D$. In order to decrease the gap between lower and upper bounds, we propose an adaptation of the REGAL.C algorithm whose regret bound depends on the factored span. Our oracle-efficient algorithms outperform previously proposed near-optimal algorithms on computer network administration simulations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题