论文标题
快速积极学习在强化学习中进行纯粹的探索
Fast active learning for pure exploration in reinforcement learning
论文作者
论文摘要
现实的环境通常会为代理提供非常有限的反馈。当环境最初未知时,一开始就可以完全不存在反馈,而代理商可能首先选择将所有精力用于有效地探索。该探索仍然是一个挑战,尽管它已经通过许多手工调整的启发式方法来解决,其一方面具有不同水平的一般性,另一些理论上支持的探索策略。他们中的许多人都通过内在动机,尤其是探索奖金来体现。勘探奖金的常见经验法则是使用$ 1/\ sqrt {n} $奖金,该奖金添加到奖励的经验估计中,其中$ n $是该特定状态(或国家行动对)的多次。我们表明,令人惊讶的是,对于无奖励探索的纯探索目标,以$ 1/n $的规模扩展的奖金带来了更快的学习率,从而提高了已知的上限,相对于地平线$ h $的依赖性。此外,我们表明,通过改进停止时间的分析,我们可以通过$ h $ $ h $改进的最佳识别设置中的样本复杂性,这是另一个纯探索目标,在该目标中,环境提供了奖励,但在勘探阶段,代理商并未对其行为受到惩罚。
Realistic environments often provide agents with very limited feedback. When the environment is initially unknown, the feedback, in the beginning, can be completely absent, and the agents may first choose to devote all their effort on exploring efficiently. The exploration remains a challenge while it has been addressed with many hand-tuned heuristics with different levels of generality on one side, and a few theoretically-backed exploration strategies on the other. Many of them are incarnated by intrinsic motivation and in particular explorations bonuses. A common rule of thumb for exploration bonuses is to use $1/\sqrt{n}$ bonus that is added to the empirical estimates of the reward, where $n$ is a number of times this particular state (or a state-action pair) was visited. We show that, surprisingly, for a pure-exploration objective of reward-free exploration, bonuses that scale with $1/n$ bring faster learning rates, improving the known upper bounds with respect to the dependence on the horizon $H$. Furthermore, we show that with an improved analysis of the stopping time, we can improve by a factor $H$ the sample complexity in the best-policy identification setting, which is another pure-exploration objective, where the environment provides rewards but the agent is not penalized for its behavior during the exploration phase.