结构匪徒的最佳学习

论文标题

结构匪徒的最佳学习

Optimal Learning for Structured Bandits

论文作者

Van Parys, Bart P. G., Golrezaei, Negin

论文摘要

我们研究结构化的多军匪徒，这是在存在结构信息的情况下在不确定性下在线决策的问题。在这个问题中，尽管随着时间的流逝，决策者仍需要发现最佳的行动方案。决策者知道有关奖励分布的某些凸结构信息；也就是说，决策者知道武器的奖励分布属于凸形紧凑型集合。在存在这样的结构信息的情况下，他们希望通过利用这些信息来最大程度地减少遗憾，而遗憾是其对基准政策的绩效差异，该政策提前知道了最好的行动。在没有结构信息的情况下，众所周知，经典的上置信度结合（UCB）和Thomson抽样算法最少。正如最近指出的那样，这两个算法都不能利用实际上通常可用的结构信息。我们提出了一种新颖的学习算法，我们称之为“ dusa”，其遗憾与信息理论的遗憾下降相匹配，以达到不变的因素，并且可以处理广泛的结构信息。我们的算法DUSA解决了对经验奖励分布的遗憾下限的双重对应物，并遵循其建议的游戏。我们表明，这个想法导致了第一个可行的计算可行的学习政策，对各种结构信息的渐近最小遗憾，包括众所周知的结构化匪徒，例如线性，Lipschitz和凸Bistits，以及由于缺乏统一和灵活的框架而没有在文献中研究的新型结构强盗。

We study structured multi-armed bandits, which is the problem of online decision-making under uncertainty in the presence of structural information. In this problem, the decision-maker needs to discover the best course of action despite observing only uncertain rewards over time. The decision-maker is aware of certain convex structural information regarding the reward distributions; that is, the decision-maker knows the reward distributions of the arms belong to a convex compact set. In the presence such structural information, they then would like to minimize their regret by exploiting this information, where the regret is its performance difference against a benchmark policy that knows the best action ahead of time. In the absence of structural information, the classical upper confidence bound (UCB) and Thomson sampling algorithms are well known to suffer minimal regret. As recently pointed out, neither algorithms are, however, capable of exploiting structural information that is commonly available in practice. We propose a novel learning algorithm that we call "DUSA" whose regret matches the information-theoretic regret lower bound up to a constant factor and can handle a wide range of structural information. Our algorithm DUSA solves a dual counterpart of the regret lower bound at the empirical reward distribution and follows its suggested play. We show that this idea leads to the first computationally viable learning policy with asymptotic minimal regret for various structural information, including well-known structured bandits such as linear, Lipschitz, and convex bandits, and novel structured bandits that have not been studied in the literature due to the lack of a unified and flexible framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题