基于内核的强化学习：有限的时间分析

论文标题

基于内核的强化学习：有限的时间分析

Kernel-Based Reinforcement Learning: A Finite-Time Analysis

论文作者

Domingues, Omar Darwiche, Ménard, Pierre, Pirotta, Matteo, Kaufmann, Emilie, Valko, Michal

论文摘要

我们考虑在有限的 - 摩尼斯强化学习问题中的探索 - 探索困境，其国家行动空间赋予了度量。我们介绍了基于模型的乐观算法内核-UCBVI，该算法利用了MDP的平滑度和奖励和过渡的非参数内核估计器，以有效地平衡探索和利用。对于$ k $情节和地平线$ h $的问题，我们提供了$ \ widetilde {o} \ left（h^3 k^{\ frac {\ frac {2d} {2d+1}}} \ right）$的遗憾，其中$ d $是联合状态行动空间的覆盖尺寸。这是使用平滑核对基于内核的RL绑定的第一个遗憾，这需要在MDP上进行非常弱的假设，并且以前已应用于广泛的任务。我们从经验上以稀疏的奖励验证了连续MDP的方法。

We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $\widetilde{O}\left( H^3 K^{\frac{2d}{2d+1}}\right)$, where $d$ is the covering dimension of the joint state-action space. This is the first regret bound for kernel-based RL using smoothing kernels, which requires very weak assumptions on the MDP and has been previously applied to a wide range of tasks. We empirically validate our approach in continuous MDPs with sparse rewards.

下载PDF全文

下载文献需遵守相关版权规定

论文标题