我会走多远：通过$ f $ - 依从性回归的离线目标条件增强学习

论文标题

我会走多远：通过$ f $ - 依从性回归的离线目标条件增强学习

How Far I'll Go: Offline Goal-Conditioned Reinforcement Learning via $f$-Advantage Regression

论文作者

Ma, Yecheng Jason, Yan, Jason, Jayaraman, Dinesh, Bastani, Osbert

论文摘要

离线目标条件的强化学习（GCRL）承诺以从纯粹的离线数据集实现各种目标的形式的通用技能学习。我们建议$ \ textbf {go} $ al-al-al-al-条件$ f $ - $ \ textbf {a} $ dvantage $ \ textbf {r} $ egression（gofar），这是一种基于新颖的回归gcrl gcrl algorithM，它衍生自国家核能匹配的匹配的观点;关键的直觉是，可以将目标任务提出为守护动态的模仿代理和直接传送到目标的专家代理之间的状态占用匹配问题。与先前的方法相反，Gofar不需要任何事后重新标签，并且对其价值和政策网络享有未融合的优化。这些独特的功能允许Gofar具有更好的离线性能和稳定性以及统计性能保证，这对于先前的方法无法实现。此外，我们证明了Gofar的训练目标可以重新使用，以从纯粹的离线源数据域数据中学习独立于代理的目标条件计划的计划者，这可以使零射击传输到新的目标域。通过广泛的实验，我们验证了Gofar在各种问题设置和任务中的有效性，显着超过了先前的最新状态。值得注意的是，在真正的机器人灵活性操纵任务上，虽然没有其他方法取得了有意义的进步，但Gofar获得了成功实现各种目标的复杂操纵行为。

Offline goal-conditioned reinforcement learning (GCRL) promises general-purpose skill learning in the form of reaching diverse goals from purely offline datasets. We propose $\textbf{Go}$al-conditioned $f$-$\textbf{A}$dvantage $\textbf{R}$egression (GoFAR), a novel regression-based offline GCRL algorithm derived from a state-occupancy matching perspective; the key intuition is that the goal-reaching task can be formulated as a state-occupancy matching problem between a dynamics-abiding imitator agent and an expert agent that directly teleports to the goal. In contrast to prior approaches, GoFAR does not require any hindsight relabeling and enjoys uninterleaved optimization for its value and policy networks. These distinct features confer GoFAR with much better offline performance and stability as well as statistical performance guarantee that is unattainable for prior methods. Furthermore, we demonstrate that GoFAR's training objectives can be re-purposed to learn an agent-independent goal-conditioned planner from purely offline source-domain data, which enables zero-shot transfer to new target domains. Through extensive experiments, we validate GoFAR's effectiveness in various problem settings and tasks, significantly outperforming prior state-of-art. Notably, on a real robotic dexterous manipulation task, while no other method makes meaningful progress, GoFAR acquires complex manipulation behavior that successfully accomplishes diverse goals.

下载PDF全文

下载文献需遵守相关版权规定

论文标题