论文标题
部署期间的自我监督政策调整
Self-Supervised Policy Adaptation during Deployment
论文作者
论文摘要
在大多数现实世界中,在一个环境中通过强化学习训练的政策需要在另一个环境中部署,这可能是完全不同的环境。但是,众所周知,跨不同环境的概括很难。一种自然的解决方案是在新环境中部署后继续培训,但是如果新环境没有提供奖励信号,则不能这样做。我们的工作探讨了自我审议的使用,以使政策在部署后继续培训而无需使用任何奖励。尽管以前的方法明确预测了新环境中的变化,但我们没有对这些变化的先验知识,但仍能获得重大改进。经验评估是在DeepMind Control Suite和Vizdoom的不同模拟环境中进行的,以及在不断变化的环境中进行的实际机器人操纵任务,从未校准的相机中进行观察。我们的方法改善了在大多数环境上的各种任务的36个环境中的31个环境中的31个环境中的概括。
In most real world scenarios, a policy trained by reinforcement learning in one environment needs to be deployed in another, potentially quite different environment. However, generalization across different environments is known to be hard. A natural solution would be to keep training after deployment in the new environment, but this cannot be done if the new environment offers no reward signal. Our work explores the use of self-supervision to allow the policy to continue training after deployment without using any rewards. While previous methods explicitly anticipate changes in the new environment, we assume no prior knowledge of those changes yet still obtain significant improvements. Empirical evaluations are performed on diverse simulation environments from DeepMind Control suite and ViZDoom, as well as real robotic manipulation tasks in continuously changing environments, taking observations from an uncalibrated camera. Our method improves generalization in 31 out of 36 environments across various tasks and outperforms domain randomization on a majority of environments.