上下文强盗，缺少奖励

论文标题

上下文强盗，缺少奖励

Contextual Bandit with Missing Rewards

论文作者

Bouneffouf, Djallel, Upadhyay, Sohini, Khazaeni, Yasaman

论文摘要

我们考虑了上下文匪徒问题的新颖变体（即，具有侧面信息或上下文的多军匪徒，可用于决策者），其中与每个基于上下文的决策相关联的奖励可能总是被观察到（“缺少奖励”）。某些在线设置（包括临床试验和广告推荐应用程序）的动机引起了这个新问题。为了解决丢失的奖励设置，我们建议将标准上下文匪徒方法与无监督的学习机制（例如聚类）相结合。与标准的上下文匪徒方法不同，通过利用聚类来估计缺失的奖励，我们可以从每个传入事件中学习，即使是那些缺少奖励的事件。在几个现实生活数据集中获得了有希望的经验结果。

We consider a novel variant of the contextual bandit problem (i.e., the multi-armed bandit with side-information, or context, available to a decision-maker) where the reward associated with each context-based decision may not always be observed("missing rewards"). This new problem is motivated by certain online settings including clinical trial and ad recommendation applications. In order to address the missing rewards setting, we propose to combine the standard contextual bandit approach with an unsupervised learning mechanism such as clustering. Unlike standard contextual bandit methods, by leveraging clustering to estimate missing reward, we are able to learn from each incoming event, even those with missing rewards. Promising empirical results are obtained on several real-life datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题