非平稳的非政策优化

论文标题

非平稳的非政策优化

Non-Stationary Off-Policy Optimization

论文作者

Hong, Joey, Kveton, Branislav, Zaheer, Manzil, Chow, Yinlam, Ahmed, Amr

论文摘要

非政策学习是从另一个策略收集的数据中评估和优化政策而无需部署政策的框架。现实世界中的环境通常是非统计的，离线策略应适应这些变化。为了应对这一挑战，我们研究了分段固定的上下文强盗中的新型违反政策优化问题。我们提出的解决方案有两个阶段。在离线学习阶段，我们将数据记录到分类潜在状态，并为每个状态学习一个近乎最佳的子政策。在在线部署阶段，我们根据其性能自适应地在学会的子合物之间切换。这种方法是实用且可分析的，我们可以保证在网上部署期间的非政策优化质量和遗憾。为了展示我们的方法的有效性，我们将其与合成和现实世界数据集的最先进基线进行了比较。我们的方法优于仅在观察到的上下文上起作用的方法。

Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context.

下载PDF全文

下载文献需遵守相关版权规定

论文标题