线性混合物Markov决策过程的几乎最小最佳增强学习

论文标题

线性混合物Markov决策过程的几乎最小最佳增强学习

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

论文作者

Zhou, Dongruo, Gu, Quanquan, Szepesvari, Csaba

论文摘要

我们研究了具有线性函数近似的增强学习（RL），其中马尔可夫决策过程的基本过渡概率内核（MDP）是线性混合模型（Jia等，2020; Ayoub等，2020; Zhou等，2020），并且学习代理人必须访问整合或采样的kernels kernels kernelels kernels kernels。我们提出了一种新的伯恩斯坦型浓度不平等，用于与有界噪声的线性匪徒问题的自相应martingles。基于新的不平等，我们提出了一种新的计算高效算法，其线性函数近似为$ \ text {ucrl-vtr}^{+} $ for上述线性混合物MDPS在情节未识别的设置中。我们表明，$ \ text {ucrl-vtr}^{+} $获得$ \ tilde o（dh \ sqrt {t}）$遗憾，而$ d $是特征映射的维度，$ h $是情节的长度，$ t $是与MDP的互动数。对于此设置，我们还证明了匹配的下限$ω（DH \ sqrt {t}）$，这表明$ \ text {ucrl-vtr}^{+} $是最小值的最佳功能。此外，我们提出了$ \ text {uclk}^{+} $算法在折扣下的MDP家族，并表明它达到了$ \ tilde o（d \ sqrt {t}/（t}/（1-γ）^{1.5}）我们的上限匹配下限$ω（D \ sqrt {t}/（1-γ）^{1.5}）$由Zhou等人证明的。（2020）到对数因素，这表明$ \ text {uclk}^{+} $几乎是最小的。据我们所知，这些是第一个计算高效，几乎最小的最佳最佳算法，用于具有线性函数近似的RL。

We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that $\text{UCRL-VTR}^{+}$ attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the dimension of feature mapping, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $Ω(dH\sqrt{T})$ for this setting, which shows that $\text{UCRL-VTR}^{+}$ is minimax optimal up to logarithmic factors. In addition, we propose the $\text{UCLK}^{+}$ algorithm for the same family of MDPs under discounting and show that it attains an $\tilde O(d\sqrt{T}/(1-γ)^{1.5})$ regret, where $γ\in [0,1)$ is the discount factor. Our upper bound matches the lower bound $Ω(d\sqrt{T}/(1-γ)^{1.5})$ proved by Zhou et al. (2020) up to logarithmic factors, suggesting that $\text{UCLK}^{+}$ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题