通过功能近似的在线模型选择用于增强学习

论文标题

通过功能近似的在线模型选择用于增强学习

Online Model Selection for Reinforcement Learning with Function Approximation

论文作者

Lee, Jonathan N., Pacchiano, Aldo, Muthukumar, Vidya, Kong, Weihao, Brunskill, Emma

论文摘要

深度强化学习取得了令人印象深刻的成功，但通常需要大量的交互数据。这个结果可能并不令人惊讶，因为使用复杂的函数近似通常需要更多的数据才能适应，而线性马尔可夫决策过程的早期理论结果则提供了遗憾的界限，以随着线性近似的维度扩展。理想情况下，我们想自动确定足以编码最佳策略的近似值的最小维度。为了实现这一目标，我们考虑了一组具有已知遗憾保证的候选RL算法的RL中的模型选择问题。学习者的目标是在不知道\ textit {a先验}的情况下适应最佳算法的复杂性。我们提出了一个元算法，该元算法使用简单的统计检验依次拒绝日益复杂的模型。给定至少一个满足可实现性的候选人，我们证明了元算法适应了最佳的复杂性，$ \ tilde {o}（l^{5/6} t^{2/3}）$遗憾与最佳候选人的$ \ tilde {o} $ iss $ is $ sog $ sog y IS $算法。相对于最佳候选者，维度和范围的依赖性仍然是最佳的，并且我们的元偏高方法可以灵活地融合多种候选算法和模型。最后，我们表明，元算法自动承认取决于候选人可实现的最大值之间的差距，从而显着改善了实例依赖性的后悔界限。

Deep reinforcement learning has achieved impressive successes yet often requires a very large amount of interaction data. This result is perhaps unsurprising, as using complicated function approximation often requires more data to fit, and early theoretical results on linear Markov decision processes provide regret bounds that scale with the dimension of the linear approximation. Ideally, we would like to automatically identify the minimal dimension of the approximation that is sufficient to encode an optimal policy. Towards this end, we consider the problem of model selection in RL with function approximation, given a set of candidate RL algorithms with known regret guarantees. The learner's goal is to adapt to the complexity of the optimal algorithm without knowing it \textit{a priori}. We present a meta-algorithm that successively rejects increasingly complex models using a simple statistical test. Given at least one candidate that satisfies realizability, we prove the meta-algorithm adapts to the optimal complexity with $\tilde{O}(L^{5/6} T^{2/3})$ regret compared to the optimal candidate's $\tilde{O}(\sqrt T)$ regret, where $T$ is the number of episodes and $L$ is the number of algorithms. The dimension and horizon dependencies remain optimal with respect to the best candidate, and our meta-algorithmic approach is flexible to incorporate multiple candidate algorithms and models. Finally, we show that the meta-algorithm automatically admits significantly improved instance-dependent regret bounds that depend on the gaps between the maximal values attainable by the candidates.

下载PDF全文

下载文献需遵守相关版权规定

论文标题