政策评估和寻求通过最佳响应的多代理强化学习

论文标题

政策评估和寻求通过最佳响应的多代理强化学习

Policy Evaluation and Seeking for Multi-Agent Reinforcement Learning via Best Response

论文作者

Yan, Rui, Duan, Xiaoming, Shi, Zongying, Zhong, Yisheng, Marden, Jason R., Bullo, Francesco

论文摘要

本文介绍了两个指标（基于周期和基于内存的指标），该指标基于一个动态的游戏理论解决方案概念，称为水槽平衡，用于评估，排名和计算多机构学习中的策略。我们采用严格的最佳响应动态（SBRD），以元级别的多代理增强学习来模拟自私的行为。我们的方法可以处理动态的周期性行为（与基于NASH均衡和ELO评级的方法不同），并且与单药增强学习相比，与依赖较弱的更好响应的Alpha-Rank相比，它与单药加固学习更兼容。我们首先考虑设置，其中最大和第二大基础度量的差异具有已知的下限。有了这些知识，我们提出了一类具有以下属性的扰动SBRD：仅观察到具有最大度量的策略，对于具有有限记忆的一系列随机游戏，其概率为非零概率。然后，我们考虑在差异的下限未知的设置。在这种情况下，我们提出了一类扰动的SBRD，以使观察到的非零概率观察到的策略的指标与任何给定的公差不同。提出的扰动SBRD通过解决对学习者的策略来解决对手引起的非平稳性，并使用经验性的游戏理论分析来估计由于扰动而获得的每个策略概况的回报。

This paper introduces two metrics (cycle-based and memory-based metrics), grounded on a dynamical game-theoretic solution concept called sink equilibrium, for the evaluation, ranking, and computation of policies in multi-agent learning. We adopt strict best response dynamics (SBRD) to model selfish behaviors at a meta-level for multi-agent reinforcement learning. Our approach can deal with dynamical cyclical behaviors (unlike approaches based on Nash equilibria and Elo ratings), and is more compatible with single-agent reinforcement learning than alpha-rank which relies on weakly better responses. We first consider settings where the difference between largest and second largest underlying metric has a known lower bound. With this knowledge we propose a class of perturbed SBRD with the following property: only policies with maximum metric are observed with nonzero probability for a broad class of stochastic games with finite memory. We then consider settings where the lower bound for the difference is unknown. For this setting, we propose a class of perturbed SBRD such that the metrics of the policies observed with nonzero probability differ from the optimal by any given tolerance. The proposed perturbed SBRD addresses the opponent-induced non-stationarity by fixing the strategies of others for the learning agent, and uses empirical game-theoretic analysis to estimate payoffs for each strategy profile obtained due to the perturbation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题