论文标题
在透明对手之间的两人游戏中学习
Learning in two-player games between transparent opponents
论文作者
论文摘要
我们考虑了一个场景,其中两个强化学习代理反复玩矩阵游戏,并在每回合后更新其参数。代理商的决策是彼此透明的,这使每个代理商都可以预测其对手的对抗。为了防止两种代理的无限回归,无限期地相互预测,每个药物都必须以至少epsilon的概率给予对手独立的响应。透明度还允许每个代理人预测和塑造另一个代理的梯度步骤,即移至参数空间区域,在该区域中,对手的梯度点朝着对它们有利于它们的方向。我们使用以前的文献(Lola和SOS)的两种算法进行对手感知学习,从而在实验上研究了最终的动力学。我们发现,相互透明的决策和对手意识学习的结合强烈地导致了单一囚犯的困境中的相互合作。在一场鸡肉游戏中,两个特工试图将对手转向他们首选的平衡,融合到互惠互利的结果变得更加困难,而且对手意识到的学习甚至可以导致两个代理商的最坏情况。这强调了需要开发对手感知的学习算法,这些学习算法在涉及均衡选择问题的社会困境中实现可接受的结果。
We consider a scenario in which two reinforcement learning agents repeatedly play a matrix game against each other and update their parameters after each round. The agents' decision-making is transparent to each other, which allows each agent to predict how their opponent will play against them. To prevent an infinite regress of both agents recursively predicting each other indefinitely, each agent is required to give an opponent-independent response with some probability at least epsilon. Transparency also allows each agent to anticipate and shape the other agent's gradient step, i.e. to move to regions of parameter space in which the opponent's gradient points in a direction favourable to them. We study the resulting dynamics experimentally, using two algorithms from previous literature (LOLA and SOS) for opponent-aware learning. We find that the combination of mutually transparent decision-making and opponent-aware learning robustly leads to mutual cooperation in a single-shot prisoner's dilemma. In a game of chicken, in which both agents try to manoeuvre their opponent towards their preferred equilibrium, converging to a mutually beneficial outcome turns out to be much harder, and opponent-aware learning can even lead to worst-case outcomes for both agents. This highlights the need to develop opponent-aware learning algorithms that achieve acceptable outcomes in social dilemmas involving an equilibrium selection problem.