论文标题
分布强劲的批处理上下文匪徒
Distributionally Robust Batch Contextual Bandits
论文作者
论文摘要
使用历史观察数据的政策学习是一个重要的问题,它发现了广泛的应用程序。示例包括选择优惠,价格,要发送给客户的广告,以及选择用于患者开处方的药物。但是,现有的文献基于这样一个关键假设,即将未来的策略部署的未来环境与生成数据的过去环境相同 - 这个假设通常是错误或太粗糙的近似值。在本文中,我们提高了这一假设,并旨在通过不完整的观察数据来学习一项稳健的策略。我们首先提出了一个政策评估程序,该程序使我们能够评估政策在最坏的环境转变下的表现。然后,我们为此建议的政策评估计划建立了中心限制定理类型保证。利用这种评估方案,我们进一步提出了一种新颖的学习算法,该算法能够学习一项对对抗性扰动和未知协变量转移的策略,并根据统一收敛理论的性能保证。最后,我们从经验上测试了合成数据集中提出的算法的有效性,并证明它提供了使用标准策略学习算法缺失的鲁棒性。我们通过在现实世界投票数据集的背景下提供了我们方法的全面应用来结束本文。
Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data -- an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data. We first present a policy evaluation procedure that allows us to assess how well the policy does under the worst-case environment shift. We then establish a central limit theorem type guarantee for this proposed policy evaluation scheme. Leveraging this evaluation scheme, we further propose a novel learning algorithm that is able to learn a policy that is robust to adversarial perturbations and unknown covariate shifts with a performance guarantee based on the theory of uniform convergence. Finally, we empirically test the effectiveness of our proposed algorithm in synthetic datasets and demonstrate that it provides the robustness that is missing using standard policy learning algorithms. We conclude the paper by providing a comprehensive application of our methods in the context of a real-world voting dataset.