无限混杂因素的无限 - 摩恩钢筋学习中的非政策评估

论文标题

无限混杂因素的无限 - 摩恩钢筋学习中的非政策评估

Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with Latent Confounders

论文作者

Bennett, Andrew, Kallus, Nathan, Li, Lihong, Mousavi, Ali

论文摘要

在实验和医疗保健等实验的环境中，强化学习中的非政策评估（OPE）是一个重要的问题。但是，在这些相同的环境中，观察到的动作通常会被未观察到的变量混淆，使OPE变得更加困难。我们研究了一个无人观察的混杂因素，在无限的马尔可夫决策过程中研究一个OPE问题，国家和行动可以充当未观察到的混杂因素的代理。我们展示了如何仅鉴于一个针对状态和行动的潜在变量模型，可以从销售数据中识别策略值。我们的方法涉及两个阶段。首先，我们展示了如何使用代理来估计固定分布比，从而扩展了将地平线诅咒到混杂设置的最新工作。在第二个中，我们表明可以将最佳平衡与这种学到的比率结合在一起，以获得策略价值，同时避免直接建模奖励功能。我们建立一致性的理论保证，并通过经验基准我们的方法。

Off-policy evaluation (OPE) in reinforcement learning is an important problem in settings where experimentation is limited, such as education and healthcare. But, in these very same settings, observed actions are often confounded by unobserved variables making OPE even more difficult. We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders, where states and actions can act as proxies for the unobserved confounders. We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data. Our method involves two stages. In the first, we show how to use proxies to estimate stationary distribution ratios, extending recent work on breaking the curse of horizon to the confounded setting. In the second, we show optimal balancing can be combined with such learned ratios to obtain policy value while avoiding direct modeling of reward functions. We establish theoretical guarantees of consistency, and benchmark our method empirically.

下载PDF全文

下载文献需遵守相关版权规定

论文标题