论文标题

COG:将新技能连接到过去的经验与离线增强学习

COG: Connecting New Skills to Past Experience with Offline Reinforcement Learning

论文作者

Singh, Avi, Yu, Albert, Yang, Jonathan, Zhang, Jesse, Kumar, Aviral, Levine, Sergey

论文摘要

强化学习已应用于各种机器人问题,但是大多数此类应用程序涉及为每个新任务收集从头开始的数据。由于我们可以为任何单个任务收集的机器人数据量受到时间和成本注意事项的限制,因此学习的行为通常是狭窄的:策略只能在经过培训的少数情况下执行任务。如果有一种方法可以通过以前解决的任务或无监督或无方向的环境互动来纳入大量先前数据,以扩展和推广学习的行为怎么办?尽管大多数先前使用预收集数据扩展机器人技能的工作都集中在构建明确的层次结构或技能分解上,但我们在本文中表明,我们可以仅通过动态编程来重复使用先前的数据以扩展新技能。我们表明,即使先前的数据实际上没有成功地解决新任务,它仍然可以通过为代理商对环境机制有更广泛的了解来学习更好的政策。我们通过将几种在以前的数据集中看到的行为链接在一起,以解决新任务,从而证明了方法的有效性,我们最难的实验设置涉及连续撰写四个机器人技能:选择,放置,抽屉打开和抓握,其中仅在任务完成时提供+1/0稀疏的奖励。我们以端到端的方式训练政策,将高维图像观察映射到低级机器人控制命令中,并在模拟和现实世界中呈现结果。可以在我们的项目网站上找到其他材料和源代码:https://sites.google.com/view/cog-rl

Reinforcement learning has been applied to a wide variety of robotics problems, but most of such applications involve collecting data from scratch for each new task. Since the amount of robot data we can collect for any single task is limited by time and cost considerations, the learned behavior is typically narrow: the policy can only execute the task in a handful of scenarios that it was trained on. What if there was a way to incorporate a large amount of prior data, either from previously solved tasks or from unsupervised or undirected environment interaction, to extend and generalize learned behaviors? While most prior work on extending robotic skills using pre-collected data focuses on building explicit hierarchies or skill decompositions, we show in this paper that we can reuse prior data to extend new skills simply through dynamic programming. We show that even when the prior data does not actually succeed at solving the new task, it can still be utilized for learning a better policy, by providing the agent with a broader understanding of the mechanics of its environment. We demonstrate the effectiveness of our approach by chaining together several behaviors seen in prior datasets for solving a new task, with our hardest experimental setting involving composing four robotic skills in a row: picking, placing, drawer opening, and grasping, where a +1/0 sparse reward is provided only on task completion. We train our policies in an end-to-end fashion, mapping high-dimensional image observations to low-level robot control commands, and present results in both simulated and real world domains. Additional materials and source code can be found on our project website: https://sites.google.com/view/cog-rl

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源