论文标题
时间差异和Q学习可以学习表示吗?平均场理论
Can Temporal-Difference and Q-Learning Learn Representation? A Mean-Field Theory
论文作者
论文摘要
时间差异和Q学习在深度强化学习中起着关键作用,在深处的增强学习中,它们受到表达性非线性功能近似值(如神经网络)的赋权。他们经验成功的核心是学识渊博的特征表示,它将丰富的观察结果(例如图像和文本)嵌入了编码语义结构的潜在空间中。同时,这种特征表示的演变对于时间差异和Q学习的收敛至关重要。 特别是,当功能近似值在特征表示中是线性时,时间差异学习会收敛,该功能表示固定在整个学习过程中,否则可能会分歧。我们的目标是回答以下问题:当功能近似值是神经网络时,关联的特征表示如何发展?如果收敛,它是否会收敛到最佳选择? 我们证明,利用过度参数化的两层神经网络,时间差异和Q学习在全球范围内最大程度地减少了均方根的投射钟声误差,以均匀的速率最小化。此外,相关的特征表示会收敛到最佳形式,从而推广了Cai等人的先前分析。 (2019)在神经切线内核方面,相关特征表示在最初的特征表示。我们分析的关键是一个平均场视角,它将有限维参数的演变与无限尺寸的瓦斯汀空间的限制对应物联系起来。我们的分析概括为软Q学习,这与政策梯度进一步相关。
Temporal-difference and Q-learning play a key role in deep reinforcement learning, where they are empowered by expressive nonlinear function approximators such as neural networks. At the core of their empirical successes is the learned feature representation, which embeds rich observations, e.g., images and texts, into the latent space that encodes semantic structures. Meanwhile, the evolution of such a feature representation is crucial to the convergence of temporal-difference and Q-learning. In particular, temporal-difference learning converges when the function approximator is linear in a feature representation, which is fixed throughout learning, and possibly diverges otherwise. We aim to answer the following questions: When the function approximator is a neural network, how does the associated feature representation evolve? If it converges, does it converge to the optimal one? We prove that, utilizing an overparameterized two-layer neural network, temporal-difference and Q-learning globally minimize the mean-squared projected Bellman error at a sublinear rate. Moreover, the associated feature representation converges to the optimal one, generalizing the previous analysis of Cai et al. (2019) in the neural tangent kernel regime, where the associated feature representation stabilizes at the initial one. The key to our analysis is a mean-field perspective, which connects the evolution of a finite-dimensional parameter to its limiting counterpart over an infinite-dimensional Wasserstein space. Our analysis generalizes to soft Q-learning, which is further connected to policy gradient.