论文标题
安全探索以优化上下文匪徒
Safe Exploration for Optimizing Contextual Bandits
论文作者
论文摘要
Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, etc. However, existing learning methods for contextual bandit problems have one of two drawbacks: they either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience.我们引入了一种新的学习方法,用于上下文匪徒问题,安全探索算法(SEA),该算法克服了上述缺点。 SEA首先使用基线(或生产)排名系统(即策略),该系统不会损害用户体验,因此可以安全执行,但具有次优的性能,因此需要改进。然后,Sea使用反事实学习来根据基线政策的行为来学习新的政策。 SEA还利用高信心的非货币评估来估计新知识的政策的表现。一旦新学习的政策的表现至少与基线政策的绩效一样好,SEA将开始使用新政策执行新的行动,从而使其能够积极探索行动空间的有利区域。这样,SEA永远不会比基线政策更糟糕,因此不会损害用户体验,同时仍在探索动作空间,从而能够找到最佳政策。我们使用文本分类和文档检索的实验通过将海上(称为BSEA的无限变体)与在线和离线学习方法进行比较,以确认上述情况,以解决上下文的强盗问题。
Contextual bandit problems are a natural fit for many information retrieval tasks, such as learning to rank, text classification, recommendation, etc. However, existing learning methods for contextual bandit problems have one of two drawbacks: they either do not explore the space of all possible document rankings (i.e., actions) and, thus, may miss the optimal ranking, or they present suboptimal rankings to a user and, thus, may harm the user experience. We introduce a new learning method for contextual bandit problems, Safe Exploration Algorithm (SEA), which overcomes the above drawbacks. SEA starts by using a baseline (or production) ranking system (i.e., policy), which does not harm the user experience and, thus, is safe to execute, but has suboptimal performance and, thus, needs to be improved. Then SEA uses counterfactual learning to learn a new policy based on the behavior of the baseline policy. SEA also uses high-confidence off-policy evaluation to estimate the performance of the newly learned policy. Once the performance of the newly learned policy is at least as good as the performance of the baseline policy, SEA starts using the new policy to execute new actions, allowing it to actively explore favorable regions of the action space. This way, SEA never performs worse than the baseline policy and, thus, does not harm the user experience, while still exploring the action space and, thus, being able to find an optimal policy. Our experiments using text classification and document retrieval confirm the above by comparing SEA (and a boundless variant called BSEA) to online and offline learning methods for contextual bandit problems.