加固学习中的计算统计差距

论文标题

加固学习中的计算统计差距

Computational-Statistical Gaps in Reinforcement Learning

论文作者

Kane, Daniel, Liu, Sihan, Lovett, Shachar, Mahajan, Gaurav

论文摘要

用功能近似的增强学习最近在具有较大状态空间的应用中取得了巨大的结果。这一经验成功促使人们越来越多的理论工作提出了必要和充分的条件，在这些条件下，有效的强化学习是可能的。从这项工作中，出现了一种非常简单的最小条件，用于样品有效的增强学习：具有最佳价值功能的MDP $ v^*$和$ q^*$线性，在某些已知的低维功能中。在这种情况下，最近的作品设计了样品有效算法，这些算法需要在特征维度中多个样品，并且独立于状态空间的大小。但是，他们将发现计算有效的算法作为未来的工作，这被认为是社区中的主要开放问题。在这项工作中，我们通过使用线性函数近似为RL的第一个计算下限来取得进展：除非NP = RP，否则对于确定性的过渡MDP，不存在具有恒定数量的动作和线性最佳值函数的确定性过渡MDP的随机多项式时间算法。为了证明这一点，我们显示了唯一SAT的减少，在该SAT中，我们将CNF公式转换为具有确定性转换，恒定动作数量和低维线性最佳值函数的MDP。该结果还表现出具有线性函数近似增强学习的第一个计算统计差距，因为潜在的统计问题是在理论上可以通过多项式的查询来解决信息，但是除非NP = rp，否则不存在计算有效的算法。最后，我们还证明了在随机指数时间假设下的准多项式时间下限。

Reinforcement learning with function approximation has recently achieved tremendous results in applications with large state spaces. This empirical success has motivated a growing body of theoretical work proposing necessary and sufficient conditions under which efficient reinforcement learning is possible. From this line of work, a remarkably simple minimal sufficient condition has emerged for sample efficient reinforcement learning: MDPs with optimal value function $V^*$ and $Q^*$ linear in some known low-dimensional features. In this setting, recent works have designed sample efficient algorithms which require a number of samples polynomial in the feature dimension and independent of the size of state space. They however leave finding computationally efficient algorithms as future work and this is considered a major open problem in the community. In this work, we make progress on this open problem by presenting the first computational lower bound for RL with linear function approximation: unless NP=RP, no randomized polynomial time algorithm exists for deterministic transition MDPs with a constant number of actions and linear optimal value functions. To prove this, we show a reduction from Unique-Sat, where we convert a CNF formula into an MDP with deterministic transitions, constant number of actions and low dimensional linear optimal value functions. This result also exhibits the first computational-statistical gap in reinforcement learning with linear function approximation, as the underlying statistical problem is information-theoretically solvable with a polynomial number of queries, but no computationally efficient algorithm exists unless NP=RP. Finally, we also prove a quasi-polynomial time lower bound under the Randomized Exponential Time Hypothesis.

下载PDF全文

下载文献需遵守相关版权规定

论文标题