双线性价值网络

论文标题

双线性价值网络

Bilinear value networks

论文作者

Hong, Zhang-Wei, Yang, Ge, Agrawal, Pulkit

论文摘要

非政策多进球强化学习的主要框架涉及估计目标条件Q值函数。当学习实现多个目标时，数据效率与Q功能对新目标的概括密切相关。事实上的范式将使用整体神经网络近似q（s，a，g）。为了改善Q功能的概括，我们提出了一种双线性分解，该双线性分解通过两个向量场之间的DOT产物的形式来代表Q值。第一个向量字段F（S，A）捕获了该州S的环境的本地动力学；而第二个组件ϕ（s，g）捕获了当前状态与目标之间的全局关系。我们表明，与先前的方法相比，我们的双线性分解方案大大提高了数据效率，并且具有较高的转移到分布式目标。用阴影手提供了模拟的提取机器人任务套件和灵巧的操纵，提供了经验证据。

The dominant framework for off-policy multi-goal reinforcement learning involves estimating goal conditioned Q-value function. When learning to achieve multiple goals, data efficiency is intimately connected with the generalization of the Q-function to new goals. The de-facto paradigm is to approximate Q(s, a, g) using monolithic neural networks. To improve the generalization of the Q-function, we propose a bilinear decomposition that represents the Q-value via a low-rank approximation in the form of a dot product between two vector fields. The first vector field, f(s, a), captures the environment's local dynamics at the state s; whereas the second component, ϕ(s, g), captures the global relationship between the current state and the goal. We show that our bilinear decomposition scheme substantially improves data efficiency, and has superior transfer to out-of-distribution goals compared to prior methods. Empirical evidence is provided on the simulated Fetch robot task-suite and dexterous manipulation with a Shadow hand.

下载PDF全文

下载文献需遵守相关版权规定

论文标题