论文标题
DART:自适应接受非线性TOP-K子集识别的拒绝
DART: aDaptive Accept RejecT for non-linear top-K subset identification
论文作者
论文摘要
我们认为在每个时间步骤中选择$ n $武器的$ k $的强盗问题。奖励可以是选定的单个武器奖励的非线性功能。直接使用多臂强盗算法需要在$ \ binom {n} {k} $选项中选择,从而使动作空间较大。为了简化问题,现有的作品在组合匪徒上{通常}假设反馈是单个奖励的线性函数。在本文中,我们证明了带有可能相关奖励的强盗反馈的顶部$ K $子集选择的下限。我们提出了一种用于组合设置的新型算法,而无需使用单独的ARM反馈或需要奖励函数线性。此外,我们的算法可用于相关的单个武器的奖励。我们的算法,自适应接受拒绝(DART),依次找到了良好的手臂,并根据置信度界定了不良的手臂。 DART在计算上是有效的,并在$ n $中使用存储线性。此外,DART在一个时间范围内,达到了$ \ tilde {\ Mathcal {o}}}(k \ sqrt {KNT})$的遗憾,在时间范围内$ t $,它匹配Bandit反馈中的下部限制,最高为$ \ sqrt {\ sqrt {\ log log {2nt}} $。当应用于交叉销售优化和最大化单个奖励的平均值时,所提出的算法的性能会超过最新算法的算法。我们还表明,DART在线性和非线性关节奖励环境中都显着优于现有方法。
We consider the bandit problem of selecting $K$ out of $N$ arms at each time step. The reward can be a non-linear function of the rewards of the selected individual arms. The direct use of a multi-armed bandit algorithm requires choosing among $\binom{N}{K}$ options, making the action space large. To simplify the problem, existing works on combinatorial bandits {typically} assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for top-$K$ subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in $N$. Further, DART achieves a regret bound of $\tilde{\mathcal{O}}(K\sqrt{KNT})$ for a time horizon $T$, which matches the lower bound in bandit feedback up to a factor of $\sqrt{\log{2NT}}$. When applied to the problem of cross-selling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of state-of-the-art algorithms. We also show that DART significantly outperforms existing methods for both linear and non-linear joint reward environments.