论文标题
部分可观测时空混沌系统的无模型预测
Multi-Stage Episodic Control for Strategic Exploration in Text Games
论文作者
论文摘要
文字冒险游戏由于其组合较大的动作空间和稀疏的奖励而引起了强化学习方法的独特挑战。这两个因素的相互作用尤其要求,因为大型动作空间需要广泛的探索,而稀疏奖励则提供了有限的反馈。这项工作建议使用多阶段的方法来解决探索-VS-VS-exploit困境,该方法在每个情节中明确地解散了这两种策略。我们的算法称为利用 - 探索(XTX),使用剥削策略开始每个情节,该策略模仿了过去的一组有前途的轨迹,然后切换到旨在发现导致看不见状态空间的新颖行动的探索策略。这种政策分解使我们能够将游戏空间的哪些部分与基于好奇心的本地探索在该领域的本地探索相结合,这是由人类如何处理这些游戏的动机。我们的方法在确定性和随机设置中分别从耶利哥基准(Hausknecht等人,2020年)中的12场比赛(Hausknecht等人,2020年)中的12场比赛中的平均正常得分显着胜过27%和11%的平均得分。尤其是在Zork1的游戏中,XTX获得了103分,比先前方法提高了2倍以上,并推动了游戏中困扰先前最先进方法的几个已知瓶颈。
Text adventure games present unique challenges to reinforcement learning methods due to their combinatorially large action spaces and sparse rewards. The interplay of these two factors is particularly demanding because large action spaces require extensive exploration, while sparse rewards provide limited feedback. This work proposes to tackle the explore-vs-exploit dilemma using a multi-stage approach that explicitly disentangles these two strategies within each episode. Our algorithm, called eXploit-Then-eXplore (XTX), begins each episode using an exploitation policy that imitates a set of promising trajectories from the past, and then switches over to an exploration policy aimed at discovering novel actions that lead to unseen state spaces. This policy decomposition allows us to combine global decisions about which parts of the game space to return to with curiosity-based local exploration in that space, motivated by how a human may approach these games. Our method significantly outperforms prior approaches by 27% and 11% average normalized score over 12 games from the Jericho benchmark (Hausknecht et al., 2020) in both deterministic and stochastic settings, respectively. On the game of Zork1, in particular, XTX obtains a score of 103, more than a 2x improvement over prior methods, and pushes past several known bottlenecks in the game that have plagued previous state-of-the-art methods.