学习政策中嵌入的结构如何影响学习四倍的运动？

论文标题

学习政策中嵌入的结构如何影响学习四倍的运动？

How does the structure embedded in learning policy affect learning quadruped locomotion?

论文作者

Zhang, Kuangen, Lee, Jongwoo, Hou, Zhimin, de Silva, Clarence W., Fu, Chenglong, Hogan, Neville

论文摘要

强化学习（RL）是一种流行的数据驱动方法，在机器人技术方面取得了巨大成功。以前的工作通常着重于学习直接输出联合扭矩的端到端（直接）策略。虽然直接政策似乎方便，但最终的绩效可能无法满足我们的期望。为了提高其性能，可以利用更复杂的奖励功能或更具结构化的政策。本文重点关注后者，因为结构化策略更直观，并且可以继承以前基于模型的控制器的见解。毫不奇怪的是，该结构（例如，对动作空间的更好选择和运动轨迹的限制）可能会受益于训练过程和以一般性为代价的政策绩效，但是定量效应仍不清楚。为了定量分析结构的效果，本文研究了三个在学习四倍的运动中具有不同结构水平的政策：直接政策，结构化政策和高度结构化的政策。培训结构化的策略可以学习任务空间阻抗控制器，而高度结构化的策略学习了针对小跑运行的控制器，我们从以前的工作中采用了该控制器。为了评估训练有素的策略，我们设计了一个模拟实验，以跟踪在力干扰下的不同所需速度。仿真结果表明，与直接政策相比，结构化政策和高度结构化的政策需要1/3和3/4的培训步骤才能获得类似的累积奖励，并且似乎比直接政策更强大，更有效。我们强调，嵌入政策中的结构显着影响学习复杂的动态（例如腿部运动）时学习复杂任务的整体表现。

Reinforcement learning (RL) is a popular data-driven method that has demonstrated great success in robotics. Previous works usually focus on learning an end-to-end (direct) policy to directly output joint torques. While the direct policy seems convenient, the resultant performance may not meet our expectations. To improve its performance, more sophisticated reward functions or more structured policies can be utilized. This paper focuses on the latter because the structured policy is more intuitive and can inherit insights from previous model-based controllers. It is unsurprising that the structure, such as a better choice of the action space and constraints of motion trajectory, may benefit the training process and the final performance of the policy at the cost of generality, but the quantitative effect is still unclear. To analyze the effect of the structure quantitatively, this paper investigates three policies with different levels of structure in learning quadruped locomotion: a direct policy, a structured policy, and a highly structured policy. The structured policy is trained to learn a task-space impedance controller and the highly structured policy learns a controller tailored for trot running, which we adopt from previous work. To evaluate trained policies, we design a simulation experiment to track different desired velocities under force disturbances. Simulation results show that structured policy and highly structured policy require 1/3 and 3/4 fewer training steps than the direct policy to achieve a similar level of cumulative reward, and seem more robust and efficient than the direct policy. We highlight that the structure embedded in the policies significantly affects the overall performance of learning a complicated task when complex dynamics are involved, such as legged locomotion.

下载PDF全文

下载文献需遵守相关版权规定

论文标题