关于使用政策空间概括的增强学习的样本复杂性

论文标题

关于使用政策空间概括的增强学习的样本复杂性

On the Sample Complexity of Reinforcement Learning with Policy Space Generalization

论文作者

Mou, Wenlong, Wen, Zheng, Chen, Xi

论文摘要

我们研究了大规模强化学习（RL）问题的最佳样本复杂性，即政策空间概括，即代理人有先验的知识，即最佳政策在已知的政策领域。现有结果表明，如果没有概括模型，RL算法的样本复杂性将不可避免地取决于状态空间和行动空间的基础，这些状态空间和行动空间在许多实际问题中都很大。为了避免对状态和行动空间规模的这种不良依赖，本文提出了一个策略空间的Eluder维度的新概念，该概念表征了任意马尔可夫决策过程（MDP）中政策学习的内在复杂性。使用模拟器Oracle，我们证明了一个近乎最佳的样本复杂性上限，仅线性地取决于Eluder尺寸。我们进一步证明，在没有模拟器的情况下，在确定性系统中遇到了类似的遗憾。

We study the optimal sample complexity in large-scale Reinforcement Learning (RL) problems with policy space generalization, i.e. the agent has a prior knowledge that the optimal policy lies in a known policy space. Existing results show that without a generalization model, the sample complexity of an RL algorithm will inevitably depend on the cardinalities of state space and action space, which are intractably large in many practical problems. To avoid such undesirable dependence on the state and action space sizes, this paper proposes a new notion of eluder dimension for the policy space, which characterizes the intrinsic complexity of policy learning in an arbitrary Markov Decision Process (MDP). Using a simulator oracle, we prove a near-optimal sample complexity upper bound that only depends linearly on the eluder dimension. We further prove a similar regret bound in deterministic systems without the simulator.

下载PDF全文

下载文献需遵守相关版权规定

论文标题