奖励调整：在计划短范围内最大化总奖励

论文标题

奖励调整：在计划短范围内最大化总奖励

Reward Tweaking: Maximizing the Total Reward While Planning for Short Horizons

论文作者

Tessler, Chen, Mannor, Shie

论文摘要

在加强学习中，折扣系数$γ$控制着代理商的有效计划范围。传统上，该参数被认为是MDP的一部分。但是，由于有效的计划范围很长，深厚的增强学习算法往往会变得不稳定，因此最近的作品将$γ$表示为超级参数，因此改变了基本的MDP，并可能导致该代理商对原始任务的次优行为。在这项工作中，我们介绍\ emph {newer theaking}。奖励调整了解折扣设置的替代奖励功能$ \ tilde r $，该设置可引起原始有限 - 霍森总奖励任务的最佳行为。从理论上讲，我们表明存在替代奖励，从而导致原始任务中的最佳性并讨论我们方法的鲁棒性。此外，我们在高维连续控制任务中执行实验，并表明奖励调整指导代理商朝着更好的长马返回，尽管它计划了短范围。

In reinforcement learning, the discount factor $γ$ controls the agent's effective planning horizon. Traditionally, this parameter was considered part of the MDP; however, as deep reinforcement learning algorithms tend to become unstable when the effective planning horizon is long, recent works refer to $γ$ as a hyper-parameter -- thus changing the underlying MDP and potentially leading the agent towards sub-optimal behavior on the original task. In this work, we introduce \emph{reward tweaking}. Reward tweaking learns a surrogate reward function $\tilde r$ for the discounted setting that induces optimal behavior on the original finite-horizon total reward task. Theoretically, we show that there exists a surrogate reward that leads to optimality in the original task and discuss the robustness of our approach. Additionally, we perform experiments in high-dimensional continuous control tasks and show that reward tweaking guides the agent towards better long-horizon returns although it plans for short horizons.

下载PDF全文

下载文献需遵守相关版权规定

论文标题