关于无奖励的加固学习，线性函数近似

论文标题

关于无奖励的加固学习，线性函数近似

On Reward-Free Reinforcement Learning with Linear Function Approximation

论文作者

Wang, Ruosong, Du, Simon S., Yang, Lin F., Salakhutdinov, Ruslan

论文摘要

无奖励加固学习（RL）是一个框架，适用于批处理RL设置和有许多感兴趣的奖励功能的设置。在探索阶段，代理在不使用预先指定的奖励函数的情况下收集样品。在探索阶段之后，给出了奖励功能，并且代理使用在探索阶段收集的样品来计算近乎最佳的策略。 Jin等。 [2020]表明，在表格设置中，代理只需要收集多项式数量的样本（就数量状态，动作数量和计划范围的数量）而言，无需奖励RL即可。但是，实际上，状态和动作的数量可能很大，因此概括性近似方案是必需的。在这项工作中，我们给出了带有线性函数近似的无奖励RL的正面和负面结果。我们在线性马尔可夫决策过程设置中给出了无奖励RL的算法，在该设置中，过渡和奖励允许线性表示。我们算法的样本复杂性在特征维度和计划范围中是多项式，并且完全独立于状态和动作的数量。在仅最佳$ q $ function允许线性表示的情况下，我们进一步给出了无奖励RL的指数下限。我们的结果暗示了无奖励RL样品复杂性的几个有趣的指数分离。

Reward-free reinforcement learning (RL) is a framework which is suitable for both the batch RL setting and the setting where there are many reward functions of interest. During the exploration phase, an agent collects samples without using a pre-specified reward function. After the exploration phase, a reward function is given, and the agent uses samples collected during the exploration phase to compute a near-optimal policy. Jin et al. [2020] showed that in the tabular setting, the agent only needs to collect polynomial number of samples (in terms of the number states, the number of actions, and the planning horizon) for reward-free RL. However, in practice, the number of states and actions can be large, and thus function approximation schemes are required for generalization. In this work, we give both positive and negative results for reward-free RL with linear function approximation. We give an algorithm for reward-free RL in the linear Markov decision process setting where both the transition and the reward admit linear representations. The sample complexity of our algorithm is polynomial in the feature dimension and the planning horizon, and is completely independent of the number of states and actions. We further give an exponential lower bound for reward-free RL in the setting where only the optimal $Q$-function admits a linear representation. Our results imply several interesting exponential separations on the sample complexity of reward-free RL.

下载PDF全文

下载文献需遵守相关版权规定

论文标题