论文标题

贝叶斯算法,用于分散的随机土匪

Bayesian Algorithms for Decentralized Stochastic Bandits

论文作者

Lalitha, Anusha, Goldsmith, Andrea

论文摘要

我们研究了一个分散的合作合作多代理的多军强盗问题,其中$ k $臂和通过网络连接的$ n $代理。在我们的模型中,每个机构的奖励分配都是相同的,并且随着时间的流逝,奖励是独立的。在每回合中,代理商选择一只手臂来玩耍,然后向邻居发送消息。目标是最大程度地减少整个网络上平均的累积后悔。我们提出了一个去中心化的贝叶斯多军强盗框架,该框架将单代机构贝叶斯强盗算法扩展到分散的设置。具体而言,我们研究了可以与现有的贝叶斯算法结合使用的信息同化算法,并且使用此信息,我们提出了一种分散的汤普森采样算法和分散的贝叶斯-UCB算法。我们分析了Bernoulli奖励下的分散的Thompson抽样算法,并在累积的遗憾中建立了依赖问题的上限。我们表明,在时间范围内,与最佳集中式代理的常数在时间范围内对量表进行对数产生,并访问了整个网络上的所有观测值。我们的分析还表征了网络结构的累积遗憾。通过广泛的数值研究,我们表明,与受到上置信度结合算法启发的最先进的算法相比,汤普森采样和贝叶斯-UCB的累积后悔较小。我们在八卦协议下以及随着时间变化的网络下实施了建议的分散汤普森采样,每个通信链接都有固定的失败概率。

We study a decentralized cooperative multi-agent multi-armed bandit problem with $K$ arms and $N$ agents connected over a network. In our model, each arm's reward distribution is same for all agents, and rewards are drawn independently across agents and over time steps. In each round, agents choose an arm to play and subsequently send a message to their neighbors. The goal is to minimize cumulative regret averaged over the entire network. We propose a decentralized Bayesian multi-armed bandit framework that extends single-agent Bayesian bandit algorithms to the decentralized setting. Specifically, we study an information assimilation algorithm that can be combined with existing Bayesian algorithms, and using this, we propose a decentralized Thompson Sampling algorithm and decentralized Bayes-UCB algorithm. We analyze the decentralized Thompson Sampling algorithm under Bernoulli rewards and establish a problem-dependent upper bound on the cumulative regret. We show that regret incurred scales logarithmically over the time horizon with constants that match those of an optimal centralized agent with access to all observations across the network. Our analysis also characterizes the cumulative regret in terms of the network structure. Through extensive numerical studies, we show that our extensions of Thompson Sampling and Bayes-UCB incur lesser cumulative regret than the state-of-art algorithms inspired by the Upper Confidence Bound algorithm. We implement our proposed decentralized Thompson Sampling under gossip protocol, and over time-varying networks, where each communication link has a fixed probability of failure.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源