基于梯度的元强化学习课程

论文标题

基于梯度的元强化学习课程

Curriculum in Gradient-Based Meta-Reinforcement Learning

论文作者

Mehta, Bhairav, Deleu, Tristan, Raparthy, Sharath Chandra, Pal, Chris J., Paull, Liam

论文摘要

基于梯度的元学习者，例如模型远语元学习（MAML），在监督和强化学习环境中表现出很少的射击性能。但是，特别是在元强化学习（Meta-RL）的情况下，我们可以证明基于梯度的元学习者对任务分布敏感。由于课程错误，代理人遭受了过度拟合，浅适应和适应不稳定的影响。在这项工作中，我们首先要突出基于梯度的元RL的有趣故障案例，并表明任务分布可以极大地影响算法输出，稳定性和性能。为了解决这个问题，我们利用了有关域随机化的最新文献的见解，并提出了元主动域随机化（Meta-ADR），该元素在与ADR相似的SIM2REAL转移中学习了基于梯度的元rl的任务课程。我们表明，这种方法会在各种模拟的运动和导航任务上引起更稳定的策略。我们评估分布式概括，并发现学习的任务分布，即使在非结构化的任务空间中，也会大大提高MAML的适应性性能。最后，我们激励在元RL中进行更好的基准测试的必要性，该基准优先于单任务改编性能。

Gradient-based meta-learners such as Model-Agnostic Meta-Learning (MAML) have shown strong few-shot performance in supervised and reinforcement learning settings. However, specifically in the case of meta-reinforcement learning (meta-RL), we can show that gradient-based meta-learners are sensitive to task distributions. With the wrong curriculum, agents suffer the effects of meta-overfitting, shallow adaptation, and adaptation instability. In this work, we begin by highlighting intriguing failure cases of gradient-based meta-RL and show that task distributions can wildly affect algorithmic outputs, stability, and performance. To address this problem, we leverage insights from recent literature on domain randomization and propose meta Active Domain Randomization (meta-ADR), which learns a curriculum of tasks for gradient-based meta-RL in a similar as ADR does for sim2real transfer. We show that this approach induces more stable policies on a variety of simulated locomotion and navigation tasks. We assess in- and out-of-distribution generalization and find that the learned task distributions, even in an unstructured task space, greatly improve the adaptation performance of MAML. Finally, we motivate the need for better benchmarking in meta-RL that prioritizes \textit{generalization} over single-task adaption performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题