论文标题
核心评估和内核贝尔曼统计
Accountable Off-Policy Evaluation With Kernel Bellman Statistics
论文作者
论文摘要
我们考虑非政策评估(OPE),该评估评估了从先前实验收集的观察到的数据的新策略的性能,而无需执行新政策。这发现了在执行成本高或安全问题的领域中的重要应用,例如医学诊断,建议系统和机器人技术。在实践中,由于来自政策范围数据的信息有限,因此非常希望构建严格的置信区间,而不仅仅是估计点估计的政策绩效。在这项工作中,我们提出了一个新的变分框架,该框架将OPE中的紧密置信度范围降低到可行的集合中的优化问题的问题,该集合以高概率捕获了真实的状态行动值函数。可行的集合是通过利用最近提出的内核贝尔曼损失的统计特性来构建的(Feng等,2019)。我们设计了一种有效的计算方法来计算我们的边界,并将其扩展以进行事后诊断和对现有估计量的校正。经验结果表明,我们的方法在不同的设置中产生紧密的置信区间。
We consider off-policy evaluation (OPE), which evaluates the performance of a new policy from observed data collected from previous experiments, without requiring the execution of the new policy. This finds important applications in areas with high execution cost or safety concerns, such as medical diagnosis, recommendation systems and robotics. In practice, due to the limited information from off-policy data, it is highly desirable to construct rigorous confidence intervals, not just point estimation, for the policy performance. In this work, we propose a new variational framework which reduces the problem of calculating tight confidence bounds in OPE into an optimization problem on a feasible set that catches the true state-action value function with high probability. The feasible set is constructed by leveraging statistical properties of a recently proposed kernel Bellman loss (Feng et al., 2019). We design an efficient computational approach for calculating our bounds, and extend it to perform post-hoc diagnosis and correction for existing estimators. Empirical results show that our method yields tight confidence intervals in different settings.