论文标题
基于内核的强化学习:有限的时间分析
Kernel-Based Reinforcement Learning: A Finite-Time Analysis
论文作者
论文摘要
我们考虑在有限的 - 摩尼斯强化学习问题中的探索 - 探索困境,其国家行动空间赋予了度量。我们介绍了基于模型的乐观算法内核-UCBVI,该算法利用了MDP的平滑度和奖励和过渡的非参数内核估计器,以有效地平衡探索和利用。对于$ k $情节和地平线$ h $的问题,我们提供了$ \ widetilde {o} \ left(h^3 k^{\ frac {\ frac {2d} {2d+1}}} \ right)$的遗憾,其中$ d $是联合状态行动空间的覆盖尺寸。这是使用平滑核对基于内核的RL绑定的第一个遗憾,这需要在MDP上进行非常弱的假设,并且以前已应用于广泛的任务。我们从经验上以稀疏的奖励验证了连续MDP的方法。
We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning problems whose state-action space is endowed with a metric. We introduce Kernel-UCBVI, a model-based optimistic algorithm that leverages the smoothness of the MDP and a non-parametric kernel estimator of the rewards and transitions to efficiently balance exploration and exploitation. For problems with $K$ episodes and horizon $H$, we provide a regret bound of $\widetilde{O}\left( H^3 K^{\frac{2d}{2d+1}}\right)$, where $d$ is the covering dimension of the joint state-action space. This is the first regret bound for kernel-based RL using smoothing kernels, which requires very weak assumptions on the MDP and has been previously applied to a wide range of tasks. We empirically validate our approach in continuous MDPs with sparse rewards.