遗憾的是在随机上下文的决斗匪徒中最小化

论文标题

遗憾的是在随机上下文的决斗匪徒中最小化

Regret Minimization in Stochastic Contextual Dueling Bandits

论文作者

Saha, Aadirupa, Gopalan, Aditya

论文摘要

我们认为在上下文设置中，随机$ k $ armed的决斗匪徒的问题，在每个回合中，学习者都会有一个$ k $项目的上下文集，每个项目都由$ d $ d $维的特征矢量表示，学习者的目标是确定每个上下文集的最佳组合。但是，与经典的上下文匪徒设置不同，我们的框架仅允许学习者从（嘈杂的）帕里式偏好（嘈杂的）偏好（召集）中获得项目反馈 - 被闻名为决策匪徒，这是各种在线决策方案的实际兴趣，例如。推荐系统，信息检索，锦标赛排名，在此更容易获取项目的相对强度而不是其绝对得分。但是，据我们所知，这项工作是第一个考虑对潜在无限决策空间的上下文决斗匪徒的遗憾最小化问题的问题，并给出了可证明的最佳算法以及匹配的下限分析。我们为设置提供了两种算法，并以各自的遗憾保证$ \ tilde o（d \ sqrt {t}）$和$ \ tilde o（\ sqrt {dt \ log k}）$。随后，我们还表明$ω（\ sqrt {dt}）$实际上是此问题的基本性能限制，这意味着我们的第二算法的最佳性。但是，对我们的第一种算法的分析相对简单，并且通常证明它的表现优于前者。最后，我们通过合适的实验证实了所有理论结果。

We consider the problem of stochastic $K$-armed dueling bandit in the contextual setting, where at each round the learner is presented with a context set of $K$ items, each represented by a $d$-dimensional feature vector, and the goal of the learner is to identify the best arm of each context sets. However, unlike the classical contextual bandit setup, our framework only allows the learner to receive item feedback in terms of their (noisy) pariwise preferences--famously studied as dueling bandits which is practical interests in various online decision making scenarios, e.g. recommender systems, information retrieval, tournament ranking, where it is easier to elicit the relative strength of the items instead of their absolute scores. However, to the best of our knowledge this work is the first to consider the problem of regret minimization of contextual dueling bandits for potentially infinite decision spaces and gives provably optimal algorithms along with a matching lower bound analysis. We present two algorithms for the setup with respective regret guarantees $\tilde O(d\sqrt{T})$ and $\tilde O(\sqrt{dT \log K})$. Subsequently we also show that $Ω(\sqrt {dT})$ is actually the fundamental performance limit for this problem, implying the optimality of our second algorithm. However the analysis of our first algorithm is comparatively simpler, and it is often shown to outperform the former empirically. Finally, we corroborate all the theoretical results with suitable experiments.

下载PDF全文

下载文献需遵守相关版权规定

论文标题