论文标题
零击的环境组成增强学习
Environment Generation for Zero-Shot Compositional Reinforcement Learning
论文作者
论文摘要
许多现实世界中的问题都是组成 - 解决它们需要完成相互依存的子任务,无论是串联还是并行,可以表示为依赖图。由于长期的视野和稀疏的回报,深厚的强化学习(RL)代理通常很难学习此类复杂的任务。为了解决这个问题,我们提出了环境的组成设计(代码),该设计训练发电机代理自动构建一系列针对RL代理当前技能水平的构图任务。该自动课程不仅使代理可以学习比其他可能的更复杂的任务,而且还选择了代理商绩效较弱的任务,从而增强了其稳健性和能力,可以在测试时间内概括零击以看不见的任务。我们分析了为什么当前的环境生成技术不足以生成组成任务的问题,并提出了一种解决这些问题的新算法。我们的结果评估了跨多个构图任务的学习和概括,包括学习与网页进行导航和互动的现实问题。我们学会生成由多个页面或房间组成的环境,并训练能够在这些环境中完成大量复杂任务的RL代理。我们贡献了两个新的基准框架来生成组成任务,用于Web导航的组成Minigrid和Gminiwob。代码的成功率比最强的基线高4倍,并证明了在3500个原始任务上学习的真实网站的强劲性能。
Many real-world problems are compositional - solving them requires completing interdependent sub-tasks, either in series or in parallel, that can be represented as a dependency graph. Deep reinforcement learning (RL) agents often struggle to learn such complex tasks due to the long time horizons and sparse rewards. To address this problem, we present Compositional Design of Environments (CoDE), which trains a Generator agent to automatically build a series of compositional tasks tailored to the RL agent's current skill level. This automatic curriculum not only enables the agent to learn more complex tasks than it could have otherwise, but also selects tasks where the agent's performance is weak, enhancing its robustness and ability to generalize zero-shot to unseen tasks at test-time. We analyze why current environment generation techniques are insufficient for the problem of generating compositional tasks, and propose a new algorithm that addresses these issues. Our results assess learning and generalization across multiple compositional tasks, including the real-world problem of learning to navigate and interact with web pages. We learn to generate environments composed of multiple pages or rooms, and train RL agents capable of completing wide-range of complex tasks in those environments. We contribute two new benchmark frameworks for generating compositional tasks, compositional MiniGrid and gMiniWoB for web navigation.CoDE yields 4x higher success rate than the strongest baseline, and demonstrates strong performance of real websites learned on 3500 primitive tasks.