来自多个记录策略的最佳非政策评估

论文标题

来自多个记录策略的最佳非政策评估

Optimal Off-Policy Evaluation from Multiple Logging Policies

论文作者

Kallus, Nathan, Saito, Yuta, Uehara, Masatoshi

论文摘要

我们从多个日志记录策略中研究了政策评估（OPE），每个策略都会产生一个固定大小的数据集，即分层采样。先前的工作指出，在此设置中，不同重要性抽样估计器的方差的顺序取决于实例依赖性，这引起了关于要使用的重要性采样权重的困境。在本文中，我们通过找到具有最小差异的多个实例（即有效的）的多个记录仪的OPE估计器来解决这一难题。特别是，我们在分层采样下建立了限制的效率，并提出了一个估计器，该估计器在给出一致的$ q $估算时就可以实现此限制。为了防止$ q $ functions的错误指定，我们还提供了一种选择假设类中控制变体以最大程度地降低差异的方法。广泛的实验证明了我们方法有效利用来自多个记录仪的货球数据进行分层采样的好处。

We study off-policy evaluation (OPE) from multiple logging policies, each generating a dataset of fixed size, i.e., stratified sampling. Previous work noted that in this setting the ordering of the variances of different importance sampling estimators is instance-dependent, which brings up a dilemma as to which importance sampling weights to use. In this paper, we resolve this dilemma by finding the OPE estimator for multiple loggers with minimum variance for any instance, i.e., the efficient one. In particular, we establish the efficiency bound under stratified sampling and propose an estimator achieving this bound when given consistent $q$-estimates. To guard against misspecification of $q$-functions, we also provide a way to choose the control variate in a hypothesis class to minimize variance. Extensive experiments demonstrate the benefits of our methods' efficiently leveraging of the stratified sampling of off-policy data from multiple loggers.

下载PDF全文

下载文献需遵守相关版权规定

论文标题