论文标题

基于价值的顺序决策的逆策略评估

Inverse Policy Evaluation for Value-based Sequential Decision-making

论文作者

Chan, Alan, de Asis, Kris, Sutton, Richard S.

论文摘要

基于价值的增强学习方法通​​常缺乏从价值函数中得出行为的适用方法。许多方法都涉及近似值迭代(例如$ q $ - 学习),并以任意熵度的估计来贪婪地行动,以确保充分探索状态空间。基于显式贪婪的行为假定该值反映了\ textit {某些}策略的价值,而贪婪的策略将是一个改进。但是,价值可以产生与\ textit {任何}策略相对应的价值函数。当无法完美地表示真实值函数时,这在功能 - 附加型制度中尤其重要。在这项工作中,我们探讨了\ textit {倒数策略评估}的使用,这是为了从价值函数派生行为的可能策略的求解过程。我们提供理论和经验结果,以表明逆策略评估与近似值迭代算法相结合,是一种基于价值控制的可行方法。

Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. Many approaches involve approximate value iteration (e.g., $Q$-learning), and acting greedily with respect to the estimates with an arbitrary degree of entropy to ensure that the state-space is sufficiently explored. Behavior based on explicit greedification assumes that the values reflect those of \textit{some} policy, over which the greedy policy will be an improvement. However, value-iteration can produce value functions that do not correspond to \textit{any} policy. This is especially relevant in the function-approximation regime, when the true value function can't be perfectly represented. In this work, we explore the use of \textit{inverse policy evaluation}, the process of solving for a likely policy given a value function, for deriving behavior from a value function. We provide theoretical and empirical results to show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源