具有异质剂的分布匪徒

论文标题

具有异质剂的分布匪徒

Distributed Bandits with Heterogeneous Agents

论文作者

Yang, Lin, Chen, Yu-zhen Janice, Hajiesmaili, Mohammad, Lui, John CS, Towsley, Don

论文摘要

本文解决了一个多代理强盗设置，其中$ m $代理商合作解决了$ k $武装的随机强盗问题的相同实例。代理是\ textit {异质}：每个代理对局部武器子集的访问有限，而代理是异步的，在决策回合之间具有不同的差距。每个代理商的目标是找到其最佳的本地部门，并且代理可以通过与他人分享其观察结果来合作。尽管代理商之间的合作提高了学习的表现，但它伴随着代理之间的沟通额外的复杂性。对于这种异质的多代理设置，我们提出了两种学习算法\ ucbo和\ aae。 We prove that both algorithms achieve order-optimal regret, which is $O\left(\sum_{i:\tildeΔ_i>0} \log T/\tildeΔ_i\right)$, where $\tildeΔ_i$ is the minimum suboptimality gap between the reward mean of arm $i$ and any local optimal arm.此外，仔细选择了合作的有价值信息，\ aae达到了$ o（\ log t）$的低通信复杂性。最后，数值实验验证了两种算法的效率。

This paper tackles a multi-agent bandit setting where $M$ agents cooperate together to solve the same instance of a $K$-armed stochastic bandit problem. The agents are \textit{heterogeneous}: each agent has limited access to a local subset of arms and the agents are asynchronous with different gaps between decision-making rounds. The goal for each agent is to find its optimal local arm, and agents can cooperate by sharing their observations with others. While cooperation between agents improves the performance of learning, it comes with an additional complexity of communication between agents. For this heterogeneous multi-agent setting, we propose two learning algorithms, \ucbo and \AAE. We prove that both algorithms achieve order-optimal regret, which is $O\left(\sum_{i:\tildeΔ_i>0} \log T/\tildeΔ_i\right)$, where $\tildeΔ_i$ is the minimum suboptimality gap between the reward mean of arm $i$ and any local optimal arm. In addition, a careful selection of the valuable information for cooperation, \AAE achieves a low communication complexity of $O(\log T)$. Last, numerical experiments verify the efficiency of both algorithms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题