论文标题

线性随机传递模型下的随机上下文决斗匪徒

Stochastic Contextual Dueling Bandits under Linear Stochastic Transitivity Models

论文作者

Bengs, Viktor, Saha, Aadirupa, Hüllermeier, Eyke

论文摘要

我们考虑到与上下文信息的决斗匪徒问题中的遗憾最小化任务。在顺序决策问题的每个回合中,学习者都会选择两种选择替代方案(ARM)的选择,并以嘈杂的偏好信息的形式接收反馈。我们假设反馈过程是由具有上下文化实用程序(COLST)的线性随机传递模型决定的,而学习者的任务是在决斗中包括最佳的ARM(具有最高的潜在上下文依赖性效用)。我们提出了一种计算有效算法,$ \ texttt {colstim} $,该算法是基于使用扰动的基础COLST模型的上下文依赖性效用估计值模仿反馈过程的选择。如果每只手臂都与$ d $维功能向量相关联,我们表明$ \ texttt {colstim} $在$ \ tilde o(\ sqrt {dt})$ seluise $ t $学习回合后感到遗憾。此外,我们还通过显示出较低的遗憾,以完善现有的平均遗憾分析的弱遗憾,从而建立了$ \ texttt {colstim} $的最佳性。我们的实验证明了其优于特殊案例COLST模型的最先进算法。

We consider the regret minimization task in a dueling bandits problem with context information. In every round of the sequential decision problem, the learner makes a context-dependent selection of two choice alternatives (arms) to be compared with each other and receives feedback in the form of noisy preference information. We assume that the feedback process is determined by a linear stochastic transitivity model with contextualized utilities (CoLST), and the learner's task is to include the best arm (with highest latent context-dependent utility) in the duel. We propose a computationally efficient algorithm, $\texttt{CoLSTIM}$, which makes its choice based on imitating the feedback process using perturbed context-dependent utility estimates of the underlying CoLST model. If each arm is associated with a $d$-dimensional feature vector, we show that $\texttt{CoLSTIM}$ achieves a regret of order $\tilde O( \sqrt{dT})$ after $T$ learning rounds. Additionally, we also establish the optimality of $\texttt{CoLSTIM}$ by showing a lower bound for the weak regret that refines the existing average regret analysis. Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源