排名政策决策

论文标题

排名政策决策

Ranking Policy Decisions

论文作者

Pouget, Hadrien, Chockler, Hana, Sun, Youcheng, Kroening, Daniel

论文摘要

通过加强学习培训的政策通常是不必要的复杂，使它们难以分析和解释。在使用$ n $时间步骤的运行中，一项政策将对要采取的行动做出$ n $的决定；我们猜想这些决策的一小部分可以为选择简单的默认操作提供价值。鉴于训练有素的政策，我们提出了一种基于统计故障定位的新型黑框方法，该方法根据这些状态中决策的重要性对环境的状态进行排名。我们认为，除其他事项外，排名的国家清单可以帮助解释和理解该政策。由于排名方法是统计的，因此对其质量的直接评估很难。作为质量的代理，我们使用该排名来制定原始策略，通过修剪确定为不重要的决策（即，默认情况下替换它们）并衡量对性能的影响，从原始策略制定了新的，更简单的策略。我们对各种标准基准的实验表明，修剪的政策可以在与原始策略相当的水平上执行。相反，我们表明，对政策决策进行排名的天真方法，例如，基于访问状态的频率进行排名，不会导致高性能修剪的政策。

Policies trained via Reinforcement Learning (RL) are often needlessly complex, making them difficult to analyse and interpret. In a run with $n$ time steps, a policy will make $n$ decisions on actions to take; we conjecture that only a small subset of these decisions delivers value over selecting a simple default action. Given a trained policy, we propose a novel black-box method based on statistical fault localisation that ranks the states of the environment according to the importance of decisions made in those states. We argue that among other things, the ranked list of states can help explain and understand the policy. As the ranking method is statistical, a direct evaluation of its quality is hard. As a proxy for quality, we use the ranking to create new, simpler policies from the original ones by pruning decisions identified as unimportant (that is, replacing them by default actions) and measuring the impact on performance. Our experiments on a diverse set of standard benchmarks demonstrate that pruned policies can perform on a level comparable to the original policies. Conversely, we show that naive approaches for ranking policy decisions, e.g., ranking based on the frequency of visiting a state, do not result in high-performing pruned policies.

下载PDF全文

下载文献需遵守相关版权规定

论文标题