论文标题

Bebold:探索超出探索区域的边界

BeBold: Exploration Beyond the Boundary of Explored Regions

论文作者

Zhang, Tianjun, Xu, Huazhe, Wang, Xiaolong, Wu, Yi, Keutzer, Kurt, Gonzalez, Joseph E., Tian, Yuandong

论文摘要

在稀疏奖励下的有效探索仍然是深入加强学习的关键挑战。为了指导探索,以前的工作大量使用了内在的奖励(IR)。 IR有许多启发式方法,包括探访计数,好奇心和国家差异。在本文中,我们分析了每种方法的利弊,并提出了反访问的受调节差异,这是IR的简单但有效的标准。该标准有助于代理探索探索区域的边界,并减轻基于计数的方法(例如短视和脱离)中的常见问题。最终的方法Bebold解决了Minigrid中的12个最具挑战性的程序生成的任务,仅120m的环境步骤,而没有任何课程学习。相比之下,先前的SOTA仅解决50%的任务。 Bebold还可以在Nethack的多个任务上实现SOTA,这是一个流行的类似Rogue的游戏,其中包含更具挑战性的程序生成的环境。

Efficient exploration under sparse rewards remains a key challenge in deep reinforcement learning. To guide exploration, previous work makes extensive use of intrinsic reward (IR). There are many heuristics for IR, including visitation counts, curiosity, and state-difference. In this paper, we analyze the pros and cons of each method and propose the regulated difference of inverse visitation counts as a simple but effective criterion for IR. The criterion helps the agent explore Beyond the Boundary of explored regions and mitigates common issues in count-based methods, such as short-sightedness and detachment. The resulting method, BeBold, solves the 12 most challenging procedurally-generated tasks in MiniGrid with just 120M environment steps, without any curriculum learning. In comparison, the previous SoTA only solves 50% of the tasks. BeBold also achieves SoTA on multiple tasks in NetHack, a popular rogue-like game that contains more challenging procedurally-generated environments.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源