可证明有效的增强学习样本策略

论文标题

可证明有效的增强学习样本策略

A Provably Efficient Sample Collection Strategy for Reinforcement Learning

论文作者

Tarbouriech, Jean, Pirotta, Matteo, Valko, Michal, Lazaric, Alessandro

论文摘要

在线增强学习（RL）的挑战之一是，代理商需要权衡环境探索的探索和样本的剥削以优化其行为。无论我们是为遗憾，样本复杂性，州空间覆盖范围还是模型估计来优化，我们都需要进行不同的探索探索权衡。在本文中，我们建议解决探索 - 开发问题，该方法是通过以下方式组成的：1）一种“客观的”算法，该算法（自适应地）规定了要收集的样品，好像它可以访问生成模型（即环境模拟器）； 2）一种“客观无关”的样本探索策略，负责尽快生成规定样品。在最新的最短路径问题中探索的方法的基础上，我们首先提供了一种算法，作为输入，在每个状态行动对中所需的样品$ b（s，a）$的数量需要$ \ tilde {o}（o} {o}（b d + d + d + d + d^{3/2} s^2 a）$ b = sum_ $ b = sum_ $ a}任何未知的与$ s $州的通信MDP，$ a $ actions and diameter $ d $。然后，我们展示了如何将这种通用探索算法与“客观的”策略配对，这些策略规定了样本要求，以应对各种环境，例如，模型估计，稀疏奖励发现，无目标的无需成本勘探，无目标的无需成本勘探 - 在交流MDP上 - 为此，我们获得了提高或新颖的样品复杂性保证。

One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior. Whether we optimize for regret, sample complexity, state-space coverage or model estimation, we need to strike a different exploration-exploitation trade-off. In this paper, we propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that (adaptively) prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., a simulator of the environment); 2) An "objective-agnostic" sample collection exploration strategy responsible for generating the prescribed samples as fast as possible. Building on recent methods for exploration in the stochastic shortest path problem, we first provide an algorithm that, given as input the number of samples $b(s,a)$ needed in each state-action pair, requires $\tilde{O}(B D + D^{3/2} S^2 A)$ time steps to collect the $B=\sum_{s,a} b(s,a)$ desired samples, in any unknown communicating MDP with $S$ states, $A$ actions and diameter $D$. Then we show how this general-purpose exploration algorithm can be paired with "objective-specific" strategies that prescribe the sample requirements to tackle a variety of settings -- e.g., model estimation, sparse reward discovery, goal-free cost-free exploration in communicating MDPs -- for which we obtain improved or novel sample complexity guarantees.

下载PDF全文

下载文献需遵守相关版权规定

论文标题