表征个性化荟萃提示学习的政策差异

论文标题

表征个性化荟萃提示学习的政策差异

Characterizing Policy Divergence for Personalized Meta-Reinforcement Learning

论文作者

Zhang, Michael

论文摘要

尽管从昂贵的探索和有限的轨迹数据中获得了充足的动力，但迅速适应了很少的增强学习（RL）的新环境，这仍然是一项具有挑战性的任务，尤其是在个性化设置方面。在这里，我们考虑将最佳策略推荐给具有潜在特征的多个实体的问题，以便各个实体可以通过独特的过渡动力学来参数化不同的环境。受到元学习现有文献的启发，我们通过关注某些环境在个性化环境中比其他环境更相似的观念扩展了以前的工作，并提出了一种无模型的元学习算法，该算法在基于梯度的适应过程中通过相关性优先考虑过去的经验。我们的算法涉及通过逆增强学习中的方法来表征过去的策略差异，我们说明了这种指标如何能够通过部署的环境有效地区分过去的策略参数，从而在测试时间内更有效地适应了更有效的快速适应。为了更有效地研究个性化，我们引入了一个导航测试台，以在培训情节中专门纳入环境多样性，并证明我们的方法在个性化设置中优于几乎没有射击的增强型学习。

Despite ample motivation from costly exploration and limited trajectory data, rapidly adapting to new environments with few-shot reinforcement learning (RL) can remain a challenging task, especially with respect to personalized settings. Here, we consider the problem of recommending optimal policies to a set of multiple entities each with potentially different characteristics, such that individual entities may parameterize distinct environments with unique transition dynamics. Inspired by existing literature in meta-learning, we extend previous work by focusing on the notion that certain environments are more similar to each other than others in personalized settings, and propose a model-free meta-learning algorithm that prioritizes past experiences by relevance during gradient-based adaptation. Our algorithm involves characterizing past policy divergence through methods in inverse reinforcement learning, and we illustrate how such metrics are able to effectively distinguish past policy parameters by the environment they were deployed in, leading to more effective fast adaptation during test time. To study personalization more effectively we introduce a navigation testbed to specifically incorporate environment diversity across training episodes, and demonstrate that our approach outperforms meta-learning alternatives with respect to few-shot reinforcement learning in personalized settings.

下载PDF全文

下载文献需遵守相关版权规定

论文标题