论文标题
支撑不足支持的非政策匪徒
Off-policy Bandits with Deficient Support
论文作者
论文摘要
在许多设置中,从过去的动作中学习有效的上下文伴随策略是非常需要的(例如语音助手,建议,搜索),因为它可以重新使用大量日志数据。但是,用于这种非政策学习的最先进方法基于反向倾向得分(IPS)的加权。 IPS加权的一个关键理论要求是记录数据的策略具有“全面支持”,这通常转化为在任何情况下对任何操作都需要非零的概率。不幸的是,许多现实世界系统会产生支持不足的数据,尤其是当动作空间较大时,我们展示了现有方法如何灾难性地失败。为了克服理论和应用之间的差距,我们确定了三种方法,尽管缺乏支持的数据存在固有的局限性,但为基于IP的学习提供了各种保证:限制行动空间,奖励外推和限制政策空间。我们系统地分析了这三种方法的统计和计算特性,并从经验上评估了它们的有效性。除了在上下文伴兰人学习中对支持缺陷的首次系统分析外,我们还以提供实用指导的建议得出结论。
Learning effective contextual-bandit policies from past actions of a deployed system is highly desirable in many settings (e.g. voice assistants, recommendation, search), since it enables the reuse of large amounts of log data. State-of-the-art methods for such off-policy learning, however, are based on inverse propensity score (IPS) weighting. A key theoretical requirement of IPS weighting is that the policy that logged the data has "full support", which typically translates into requiring non-zero probability for any action in any context. Unfortunately, many real-world systems produce support deficient data, especially when the action space is large, and we show how existing methods can fail catastrophically. To overcome this gap between theory and applications, we identify three approaches that provide various guarantees for IPS-based learning despite the inherent limitations of support-deficient data: restricting the action space, reward extrapolation, and restricting the policy space. We systematically analyze the statistical and computational properties of these three approaches, and we empirically evaluate their effectiveness. In addition to providing the first systematic analysis of support-deficiency in contextual-bandit learning, we conclude with recommendations that provide practical guidance.