从差异最小化的角度了解事后的目标重新标记

论文标题

从差异最小化的角度了解事后的目标重新标记

Understanding Hindsight Goal Relabeling from a Divergence Minimization Perspective

论文作者

Zhang, Lunjun, Stadie, Bradly C.

论文摘要

事后观察目标重新标记已成为多进球强化学习（RL）的基础技术。重要的想法是，任何轨迹都可以看作是达到其最终状态的次优示威。直观地，从这些任意示范中学习可以看作是一种模仿学习形式（IL）。但是，事后观察目标重新标签与模仿学习之间的联系并不理解。在本文中，我们提出了一个新颖的框架，以了解从差异最小化的角度来理解重新标记的事后目标。在IL框架中重新塑造目标达到问题不仅使我们能够从第一原则中得出几种现有方法，而且还为我们提供了IL的工具，以改善目标达到算法的目标。在实验上，我们发现在事后重新标记下，Q学习优于行为克隆（BC）。然而，两种香草的组合都损害了性能。具体而言，我们看到，只有在选择性地应用于使代理商根据Q功能更接近目标的动作时，BC损失才有助于。我们的框架还解释了令人困惑的现象，其中（-1，0）的奖励比（0，1）目标达到目标的奖励明显好得多。

Hindsight goal relabeling has become a foundational technique in multi-goal reinforcement learning (RL). The essential idea is that any trajectory can be seen as a sub-optimal demonstration for reaching its final state. Intuitively, learning from those arbitrary demonstrations can be seen as a form of imitation learning (IL). However, the connection between hindsight goal relabeling and imitation learning is not well understood. In this paper, we propose a novel framework to understand hindsight goal relabeling from a divergence minimization perspective. Recasting the goal reaching problem in the IL framework not only allows us to derive several existing methods from first principles, but also provides us with the tools from IL to improve goal reaching algorithms. Experimentally, we find that under hindsight relabeling, Q-learning outperforms behavioral cloning (BC). Yet, a vanilla combination of both hurts performance. Concretely, we see that the BC loss only helps when selectively applied to actions that get the agent closer to the goal according to the Q-function. Our framework also explains the puzzling phenomenon wherein a reward of (-1, 0) results in significantly better performance than a (0, 1) reward for goal reaching.

下载PDF全文

下载文献需遵守相关版权规定

论文标题