论文标题
奖励调整:在计划短范围内最大化总奖励
Reward Tweaking: Maximizing the Total Reward While Planning for Short Horizons
论文作者
论文摘要
在加强学习中,折扣系数$γ$控制着代理商的有效计划范围。传统上,该参数被认为是MDP的一部分。但是,由于有效的计划范围很长,深厚的增强学习算法往往会变得不稳定,因此最近的作品将$γ$表示为超级参数,因此改变了基本的MDP,并可能导致该代理商对原始任务的次优行为。在这项工作中,我们介绍\ emph {newer theaking}。奖励调整了解折扣设置的替代奖励功能$ \ tilde r $,该设置可引起原始有限 - 霍森总奖励任务的最佳行为。从理论上讲,我们表明存在替代奖励,从而导致原始任务中的最佳性并讨论我们方法的鲁棒性。此外,我们在高维连续控制任务中执行实验,并表明奖励调整指导代理商朝着更好的长马返回,尽管它计划了短范围。
In reinforcement learning, the discount factor $γ$ controls the agent's effective planning horizon. Traditionally, this parameter was considered part of the MDP; however, as deep reinforcement learning algorithms tend to become unstable when the effective planning horizon is long, recent works refer to $γ$ as a hyper-parameter -- thus changing the underlying MDP and potentially leading the agent towards sub-optimal behavior on the original task. In this work, we introduce \emph{reward tweaking}. Reward tweaking learns a surrogate reward function $\tilde r$ for the discounted setting that induces optimal behavior on the original finite-horizon total reward task. Theoretically, we show that there exists a surrogate reward that leads to optimality in the original task and discuss the robustness of our approach. Additionally, we perform experiments in high-dimensional continuous control tasks and show that reward tweaking guides the agent towards better long-horizon returns although it plans for short horizons.