向前和逆增强学习共享网络权重和超参数

论文标题

向前和逆增强学习共享网络权重和超参数

Forward and inverse reinforcement learning sharing network weights and hyperparameters

论文作者

Uchibe, Eiji, Doya, Kenji

论文摘要

本文提出了无模型的模型学习，名为“熵调节化的模仿学习（ERIL）”，将反向kullback-leibler（KL）差异最小化。埃里尔（Eril）在熵登记的马尔可夫决策过程的框架下结合了向前和逆增强学习（RL）。一个反RL步骤通过评估两个二进制歧视器来计算两个分布之间的对数比率。第一个歧视者将向前RL步骤与专家的状态区分开。由熵正则理论构成的第二个歧视因素将学习者与专家研究者产生的状态行动 - 隔离状态区分开。一个值得注意的功能是，第二个鉴别器与正向RL共享超参数，可用于控制歧视者的能力。正向RL步骤最小化由反RL步骤估计的反向KL。我们表明，将反向KL差异最小化等同于寻找最佳策略。我们使用机器人臂对穆乔科（Mujoco）模拟的环境和基于视觉的到达任务的实验结果表明，埃里尔（Eril）比基线方法更有效。我们将方法应用于执行钢管平衡任务的人类行为，并描述估计的奖励功能如何显示每个主题如何实现她的目标。

This paper proposes model-free imitation learning named Entropy-Regularized Imitation Learning (ERIL) that minimizes the reverse Kullback-Leibler (KL) divergence. ERIL combines forward and inverse reinforcement learning (RL) under the framework of an entropy-regularized Markov decision process. An inverse RL step computes the log-ratio between two distributions by evaluating two binary discriminators. The first discriminator distinguishes the state generated by the forward RL step from the expert's state. The second discriminator, which is structured by the theory of entropy regularization, distinguishes the state-action-next-state tuples generated by the learner from the expert ones. One notable feature is that the second discriminator shares hyperparameters with the forward RL, which can be used to control the discriminator's ability. A forward RL step minimizes the reverse KL estimated by the inverse RL step. We show that minimizing the reverse KL divergence is equivalent to finding an optimal policy. Our experimental results on MuJoCo-simulated environments and vision-based reaching tasks with a robotic arm show that ERIL is more sample-efficient than the baseline methods. We apply the method to human behaviors that perform a pole-balancing task and describe how the estimated reward functions show how every subject achieves her goal.

下载PDF全文

下载文献需遵守相关版权规定

论文标题