面向用户的强大增强学习

论文标题

面向用户的强大增强学习

User-Oriented Robust Reinforcement Learning

论文作者

You, Haoyi, Yu, Beichen, Jin, Haiming, Yang, Zhaoxing, Sun, Jiahui

论文摘要

最近，在不同环境中改善政策的鲁棒性吸引了增强学习（RL）社区的越来越多的关注。现有的强大RL方法主要旨在通过在最坏的环境中优化政策的绩效来实现最大的鲁棒性。但是，实际上，使用RL策略的用户在跨环境中的性能可能具有不同的偏好。显然，上述最大智能稳健性通常太保守了，无法满足用户的喜好。因此，在本文中，我们将用户偏好集成到鲁棒的RL中，并提出了一种新颖的面向用户的Robust RL（UOR-RL）框架。具体而言，我们为RL定义了一个新的面向用户的鲁棒性（UOR）度量，该度量根据用户喜好将不同的权重分配给环境，并概括了Max-Min稳定性度量。为了优化UOR指标，我们分别为有或没有先验已知环境分布的情况下开发了两种不同的UOR-RL训练算法。从理论上讲，我们证明我们的UOR-RL培训算法即使对环境分布的不准确或完全不了解，我们的UOR-RL培训算法也会融合到近乎最佳的政策。此外，我们在4个Mujoco任务中进行了广泛的实验评估。实验结果表明，UOR-RL在平均和最差的性能指标下与最先进的基线相媲美，更重要的是在UOR度量下建立了新的最新性能。

Recently, improving the robustness of policies across different environments attracts increasing attention in the reinforcement learning (RL) community. Existing robust RL methods mostly aim to achieve the max-min robustness by optimizing the policy's performance in the worst-case environment. However, in practice, a user that uses an RL policy may have different preferences over its performance across environments. Clearly, the aforementioned max-min robustness is oftentimes too conservative to satisfy user preference. Therefore, in this paper, we integrate user preference into policy learning in robust RL, and propose a novel User-Oriented Robust RL (UOR-RL) framework. Specifically, we define a new User-Oriented Robustness (UOR) metric for RL, which allocates different weights to the environments according to user preference and generalizes the max-min robustness metric. To optimize the UOR metric, we develop two different UOR-RL training algorithms for the scenarios with or without a priori known environment distribution, respectively. Theoretically, we prove that our UOR-RL training algorithms converge to near-optimal policies even with inaccurate or completely no knowledge about the environment distribution. Furthermore, we carry out extensive experimental evaluations in 4 MuJoCo tasks. The experimental results demonstrate that UOR-RL is comparable to the state-of-the-art baselines under the average and worst-case performance metrics, and more importantly establishes new state-of-the-art performance under the UOR metric.

下载PDF全文

下载文献需遵守相关版权规定

论文标题