计划练习：通过在潜在空间中构建目标来有效地在线微调

论文标题

计划练习：通过在潜在空间中构建目标来有效地在线微调

Planning to Practice: Efficient Online Fine-Tuning by Composing Goals in Latent Space

论文作者

Fang, Kuan, Yin, Patrick, Nair, Ashvin, Levine, Sergey

论文摘要

通用机器人需要各种行为的曲目，才能在现实世界中的非结构化环境中完成具有挑战性的任务。为了解决这个问题，目标条件的强化学习旨在获取可以在指挥上实现各种任务的可配置目标的政策。但是，众所周知，这种目标条件的政策是困难的，而且很耗时，可以从头开始训练。在本文中，我们建议计划练习（PTP），这种方法使得针对需要多种不同类型的互动来解决的长途任务培训目标条件政策是可行的。我们的方法基于两个关键思想。首先，我们使用高级计划者分解了目标问题，该计划者使用潜在空间中有条件的子距离发电机设置中间子目标，以制定低级模型策略。其次，我们提出了一种混合方法，该方法首先先预先培训有条件的子观念发电机和先前通过离线强化学习的数据，然后通过在线探索进行微调策略。计划的子目标本身促进了这个微调的过程，该过程将原始目标任务分解为短距离目标的任务，这些任务非常容易学习。我们在模拟和现实世界中进行了实验，在这些实验中，该策略是在简短的原始行为的演示中进行了预训练的，并且对离线数据中看不见的时间扩展任务进行了微调。我们的实验结果表明，PTP可以生成可行的子目标序列，从而使策略能够有效地解决目标任务。

General-purpose robots require diverse repertoires of behaviors to complete challenging tasks in real-world unstructured environments. To address this issue, goal-conditioned reinforcement learning aims to acquire policies that can reach configurable goals for a wide range of tasks on command. However, such goal-conditioned policies are notoriously difficult and time-consuming to train from scratch. In this paper, we propose Planning to Practice (PTP), a method that makes it practical to train goal-conditioned policies for long-horizon tasks that require multiple distinct types of interactions to solve. Our approach is based on two key ideas. First, we decompose the goal-reaching problem hierarchically, with a high-level planner that sets intermediate subgoals using conditional subgoal generators in the latent space for a low-level model-free policy. Second, we propose a hybrid approach which first pre-trains both the conditional subgoal generator and the policy on previously collected data through offline reinforcement learning, and then fine-tunes the policy via online exploration. This fine-tuning process is itself facilitated by the planned subgoals, which breaks down the original target task into short-horizon goal-reaching tasks that are significantly easier to learn. We conduct experiments in both the simulation and real world, in which the policy is pre-trained on demonstrations of short primitive behaviors and fine-tuned for temporally extended tasks that are unseen in the offline data. Our experimental results show that PTP can generate feasible sequences of subgoals that enable the policy to efficiently solve the target tasks.

下载PDF全文

下载文献需遵守相关版权规定

论文标题