使用强化学习和生成模型的数据有效的视觉运动策略培训

论文标题

使用强化学习和生成模型的数据有效的视觉运动策略培训

Data-efficient visuomotor policy training using reinforcement learning and generative models

论文作者

Ghadirzadeh, Ali, Poklukar, Petra, Kyrki, Ville, Kragic, Danica, Björkman, Mårten

论文摘要

我们提出了一个数据效率的框架，用于解决视觉序列的决策问题，该问题利用了增强学习（RL）和潜在变量生成模型的组合。 Our framework trains deep visuomotor policies by introducing an action latent variable such that the feed-forward policy search can be divided into three parts: (i) training a sub-policy that outputs a distribution over the action latent variable given a state of the system, (ii) unsupervised training of a generative model that outputs a sequence of motor actions conditioned on the latent action variable, and (iii) supervised training of the deep visuomotor以端到端方式进行政策。我们的方法可以安全探索并减轻数据信息问题，因为它利用有关有效运动动作序列的先验知识。此外，我们提供了一组评估生成模型的措施，使我们能够在物理机器人进行实际培训之前预测RL策略培训的性能。我们为评估潜在表示的质量定义了两种新的分解和局部线性衡量标准，并通过现有的评估学习分布的措施进行补充。我们通过实验确定不同生成模型的特征，这些模型对机器人拾取任务的最终政策培训的绩效影响最大。

We present a data-efficient framework for solving visuomotor sequential decision-making problems which exploits the combination of reinforcement learning (RL) and latent variable generative models. Our framework trains deep visuomotor policies by introducing an action latent variable such that the feed-forward policy search can be divided into three parts: (i) training a sub-policy that outputs a distribution over the action latent variable given a state of the system, (ii) unsupervised training of a generative model that outputs a sequence of motor actions conditioned on the latent action variable, and (iii) supervised training of the deep visuomotor policy in an end-to-end fashion. Our approach enables safe exploration and alleviates the data-inefficiency problem as it exploits prior knowledge about valid sequences of motor actions. Moreover, we provide a set of measures for evaluation of generative models such that we are able to predict the performance of the RL policy training prior to the actual training on a physical robot. We define two novel measures of disentanglement and local linearity for assessing the quality of latent representations, and complement them with existing measures for assessment of the learned distribution. We experimentally determine the characteristics of different generative models that have the most influence on performance of the final policy training on a robotic picking task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题