论文标题
基于价值的顺序决策的逆策略评估
Inverse Policy Evaluation for Value-based Sequential Decision-making
论文作者
论文摘要
基于价值的增强学习方法通常缺乏从价值函数中得出行为的适用方法。许多方法都涉及近似值迭代(例如$ q $ - 学习),并以任意熵度的估计来贪婪地行动,以确保充分探索状态空间。基于显式贪婪的行为假定该值反映了\ textit {某些}策略的价值,而贪婪的策略将是一个改进。但是,价值可以产生与\ textit {任何}策略相对应的价值函数。当无法完美地表示真实值函数时,这在功能 - 附加型制度中尤其重要。在这项工作中,我们探讨了\ textit {倒数策略评估}的使用,这是为了从价值函数派生行为的可能策略的求解过程。我们提供理论和经验结果,以表明逆策略评估与近似值迭代算法相结合,是一种基于价值控制的可行方法。
Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. Many approaches involve approximate value iteration (e.g., $Q$-learning), and acting greedily with respect to the estimates with an arbitrary degree of entropy to ensure that the state-space is sufficiently explored. Behavior based on explicit greedification assumes that the values reflect those of \textit{some} policy, over which the greedy policy will be an improvement. However, value-iteration can produce value functions that do not correspond to \textit{any} policy. This is especially relevant in the function-approximation regime, when the true value function can't be perfectly represented. In this work, we explore the use of \textit{inverse policy evaluation}, the process of solving for a likely policy given a value function, for deriving behavior from a value function. We provide theoretical and empirical results to show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control.