低差异偏离政策评估，基于州的重要性抽样

论文标题

低差异偏离政策评估，基于州的重要性抽样

Low Variance Off-policy Evaluation with State-based Importance Sampling

论文作者

Bossens, David M., Thomas, Philip S.

论文摘要

在许多领域中，强化学习的探索过程将太成本太高，因为它需要尝试次优政策，从而导致需要进行非政策评估，在该策略中，根据已知行为政策收集的数据，对目标策略进行了评估。在这种情况下，重要性采样估计器通过根据目标策略和行为策略的概率比率加权轨迹来提供预期收益的估计。不幸的是，此类估计器的差异很大，因此均值较大。本文提出了基于状态的重要性抽样估计器，该估计量通过从重要性权重的计算中删除某些状态来减少差异。为了说明它们的适用性，我们展示了基于状态的普通重要性抽样，加权重要性抽样，每项任务重要性采样，增量重要性采样，双重强大的固定式离线评估和固定密度比率估计的变体。四个领域的实验表明，与传统同行相比，基于州的方法始终降低了方差和提高准确性。

In many domains, the exploration process of reinforcement learning will be too costly as it requires trying out suboptimal policies, resulting in a need for off-policy evaluation, in which a target policy is evaluated based on data collected from a known behaviour policy. In this context, importance sampling estimators provide estimates for the expected return by weighting the trajectory based on the probability ratio of the target policy and the behaviour policy. Unfortunately, such estimators have a high variance and therefore a large mean squared error. This paper proposes state-based importance sampling estimators which reduce the variance by dropping certain states from the computation of the importance weight. To illustrate their applicability, we demonstrate state-based variants of ordinary importance sampling, weighted importance sampling, per-decision importance sampling, incremental importance sampling, doubly robust off-policy evaluation, and stationary density ratio estimation. Experiments in four domains show that state-based methods consistently yield reduced variance and improved accuracy compared to their traditional counterparts.

下载PDF全文

下载文献需遵守相关版权规定

论文标题