具有功能映射的折扣MDP的可证明有效的加固学习

论文标题

具有功能映射的折扣MDP的可证明有效的加固学习

Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping

论文作者

Zhou, Dongruo, He, Jiafan, Gu, Quanquan

论文摘要

强化学习中的现代任务具有较大的状态和行动空间。为了有效地处理它们，人们经常使用预定义的特征映射来表示在低维空间中的状态和动作。在本文中，我们研究了折扣马尔可夫决策过程（MDP）的加固学习，其中过渡内核可以作为某些特征映射的线性函数进行参数化。我们提出了一种使用功能映射的新型算法，并获得了$ \ tilde o（d \ sqrt {t}/（1-γ）^2）$遗憾，其中$ d $是特征空间的维度，$ t $是Time Horizon和$γ$是MDP的折扣系数。据我们所知，这是第一个多项式遗憾，而无需访问生成模型或做出强有力的假设，例如MDP的成真。通过构建特殊的MDP类别，我们还表明，对于任何算法，遗憾的是由$ω（d \ sqrt {t}/（1-γ）^{1.5}）$降低。我们的上限和下限结果一起表明，所提出的增强学习算法几乎是最佳的，最多可达$（1-γ）^{ - 0.5} $。

Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low-dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm that makes use of the feature mapping and obtains a $\tilde O(d\sqrt{T}/(1-γ)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $γ$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing the generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by $Ω(d\sqrt{T}/(1-γ)^{1.5})$. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $(1-γ)^{-0.5}$ factor.

下载PDF全文

下载文献需遵守相关版权规定

论文标题