永不放弃：学习定向探索策略

论文标题

永不放弃：学习定向探索策略

Never Give Up: Learning Directed Exploration Strategies

论文作者

Badia, Adrià Puigdomènech, Sprechmann, Pablo, Vitvitskyi, Alex, Guo, Daniel, Piot, Bilal, Kapturowski, Steven, Tieleman, Olivier, Arjovsky, Martín, Pritzel, Alexander, Bolt, Andew, Blundell, Charles

论文摘要

我们建议通过学习一系列定向探索性政策来解决艰苦的探索游戏。我们在代理商最近的经验中，使用K-Nearest邻居构建了基于情节记忆的内在奖励，以培训有指示的探索性政策，从而鼓励代理商反复重新审视其环境中所有州的所有州。自我监督的逆动力学模型用于训练最近邻居查找的嵌入，将新颖信号偏向代理可以控制的东西。我们采用通用价值函数近似器（UVFA）的框架同时学习许多具有相同神经网络的定向勘探策略，在探索和开发之间进行了不同的权衡。通过对不同程度的探索/剥削使用相同的神经网络，从产生有效的剥削策略的主要探索性策略中证明了转移。可以将提出的方法与现代分布式RL代理一起运行，这些RL代理从许多在单独的环境实例上并行运行的参与者收集大量经验。我们的方法在Atari-57套件中所有硬探索中的基础代理商的性能翻了一番，同时在其余游戏中保持了很高的分数，获得了中位数人类标准化得分为1344.0％。值得注意的是，提出的方法是第一个在陷阱游戏中获得非零奖励（平均得分为8,400）的算法！不使用示范或手工制作的功能。

We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies, thereby encouraging the agent to repeatedly revisit all states in its environment. A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control. We employ the framework of Universal Value Function Approximators (UVFA) to simultaneously learn many directed exploration policies with the same neural network, with different trade-offs between exploration and exploitation. By using the same neural network for different degrees of exploration/exploitation, transfer is demonstrated from predominantly exploratory policies yielding effective exploitative policies. The proposed method can be incorporated to run with modern distributed RL agents that collect large amounts of experience from many actors running in parallel on separate environment instances. Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0%. Notably, the proposed method is the first algorithm to achieve non-zero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or hand-crafted features.

下载PDF全文

下载文献需遵守相关版权规定

论文标题