论文标题

通过内核嵌入无需重新加固学习的价值函数近似

Value Function Approximations via Kernel Embeddings for No-Regret Reinforcement Learning

论文作者

Chowdhury, Sayak Ray, Oliveira, Rafael

论文摘要

我们认为在情节环境中的增强学习(RL)中的遗憾最小化问题。在许多实际的RL环境中,状态和行动空间是连续的或非常大的。现有方法通过随机过渡模型的低维表示或$ q $ functions的近似值来确定遗憾的保证。但是,对国家价值函数功能近似方案的理解基本上仍然缺失。在本文中,我们提出了一种基于在线模型的RL算法,即CME-RL,该算法将过渡分布的表示形式作为嵌入在复制的内核希尔伯特领域中的嵌入,同时仔细平衡了利用探索 - 探索权衡取舍。 We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $\tilde{O}\big(Hγ_N\sqrt{N}\big)$\footnote{ $\tilde{O}(\cdot)$ hides only absolute constant and poly-logarithmic factors.}, where $H$ is the episode长度,$ n $是时间步长的总数,而$γ_n$是信息理论数量,该信息数量与国家行动特征空间的有效维度有关。我们的方法绕过了估计过渡概率的需求,并适用于可以定义内核的任何域。它还为内核方法的一般理论带来了新的见解,以进行近似推断和RL遗憾的最小化。

We consider the regret minimization problem in reinforcement learning (RL) in the episodic setting. In many real-world RL environments, the state and action spaces are continuous or very large. Existing approaches establish regret guarantees by either a low-dimensional representation of the stochastic transition model or an approximation of the $Q$-functions. However, the understanding of function approximation schemes for state-value functions largely remains missing. In this paper, we propose an online model-based RL algorithm, namely the CME-RL, that learns representations of transition distributions as embeddings in a reproducing kernel Hilbert space while carefully balancing the exploitation-exploration tradeoff. We demonstrate the efficiency of our algorithm by proving a frequentist (worst-case) regret bound that is of order $\tilde{O}\big(Hγ_N\sqrt{N}\big)$\footnote{ $\tilde{O}(\cdot)$ hides only absolute constant and poly-logarithmic factors.}, where $H$ is the episode length, $N$ is the total number of time steps and $γ_N$ is an information theoretic quantity relating the effective dimension of the state-action feature space. Our method bypasses the need for estimating transition probabilities and applies to any domain on which kernels can be defined. It also brings new insights into the general theory of kernel methods for approximate inference and RL regret minimization.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源