论文标题
通过低级矩阵估计进行样品有效的增强学习
Sample Efficient Reinforcement Learning via Low-Rank Matrix Estimation
论文作者
论文摘要
我们考虑以样本有效的方式学习$ q $功能的问题,以在生成模型下使用连续状态和动作空间进行增强学习。如果$ q $ function是Lipschitz的连续的,那么估计$ε$ - 最佳$ Q $ function的最小样本复杂性被称为$ω(\ frac {1} {ε^{d_1 +d_1 +d_2 +2}}} $ d_ $ d_ $ d_1 $ d_ $ d_1 $ d_1 $ d_1 $ d_1 $ d_1 $ d_1空间分别。 $ Q $ - 功能被视为内核,会引起希尔伯特 - 史密特操作员,因此具有令人愉悦的频谱。这促使我们考虑通过其“等级” $ r $参数化的参数类别的参数类别,其中包含所有Lipschitz $ q $ functions作为$ r \ to \ infty $。作为我们的关键贡献,我们开发了一种简单的迭代学习算法,该算法找到$ \ widetilde {o}的样本复杂性{\ frac {1} {ε^{ε^{\ max(d_1,d_1,d_2)+2 $ q的样本复杂性的$ \ widetilde {o}(\ frac {1}) $γ$低于一定阈值。因此,这提供了样品复杂性的指数改善。为了启用我们的结果,我们开发了一种新颖的矩阵估计算法,该算法忠实地估算了$ \ ell_ \ infty $ sense中未知的低级矩阵,即使在存在任意有限的噪声的情况下,这本身可能引起人们的关注。几个随机控制任务的经验结果证实了我们“低级别”算法的功效。
We consider the question of learning $Q$-function in a sample efficient manner for reinforcement learning with continuous state and action spaces under a generative model. If $Q$-function is Lipschitz continuous, then the minimal sample complexity for estimating $ε$-optimal $Q$-function is known to scale as $Ω(\frac{1}{ε^{d_1+d_2 +2}})$ per classical non-parametric learning theory, where $d_1$ and $d_2$ denote the dimensions of the state and action spaces respectively. The $Q$-function, when viewed as a kernel, induces a Hilbert-Schmidt operator and hence possesses square-summable spectrum. This motivates us to consider a parametric class of $Q$-functions parameterized by its "rank" $r$, which contains all Lipschitz $Q$-functions as $r \to \infty$. As our key contribution, we develop a simple, iterative learning algorithm that finds $ε$-optimal $Q$-function with sample complexity of $\widetilde{O}(\frac{1}{ε^{\max(d_1, d_2)+2}})$ when the optimal $Q$-function has low rank $r$ and the discounting factor $γ$ is below a certain threshold. Thus, this provides an exponential improvement in sample complexity. To enable our result, we develop a novel Matrix Estimation algorithm that faithfully estimates an unknown low-rank matrix in the $\ell_\infty$ sense even in the presence of arbitrary bounded noise, which might be of interest in its own right. Empirical results on several stochastic control tasks confirm the efficacy of our "low-rank" algorithms.