用alphazero-likeranked奖励加固学习来解决Morpion纸牌

论文标题

用alphazero-likeranked奖励加固学习来解决Morpion纸牌

Tackling Morpion Solitaire with AlphaZero-likeRanked Reward Reinforcement Learning

论文作者

Wang, Hui, Preuss, Mike, Emmerich, Michael, Plaat, Aske

论文摘要

Morpion Solitaire是一款流行的单人游戏，用纸和铅笔表演。由于其较大的状态空间（按照GO的游戏顺序）传统搜索算法（例如MCT）无法找到良好的解决方案。后来的算法，即嵌套的推出策略适应，能够找到82个步骤的新记录，尽管具有大量的计算资源。据我们所知，完成了这一记录后，大约十年来一直没有进一步的进步。在本文中，我们以Alphago/Alphazero的深刻自学加强学习方法的最新表现为灵感来设计Morpion Solitaire的搜索者。 Morpion Solitaire的一个挑战是状态空间很少，几乎没有赢/损失信号。取而代之的是，我们使用一种称为排名奖励的方法来为Morpion Solitaire创建强化学习的自我播放框架。这使我们能够以合理的计算工作找到中等质量的解决方案。我们的记录是一个67个步骤的解决方案，它非常接近人类最佳（68），而没有任何其他适应该问题的方法，而不是使用排名奖励。我们列出了许多进一步的进步途径。

Morpion Solitaire is a popular single player game, performed with paper and pencil. Due to its large state space (on the order of the game of Go) traditional search algorithms, such as MCTS, have not been able to find good solutions. A later algorithm, Nested Rollout Policy Adaptation, was able to find a new record of 82 steps, albeit with large computational resources. After achieving this record, to the best of our knowledge, there has been no further progress reported, for about a decade. In this paper we take the recent impressive performance of deep self-learning reinforcement learning approaches from AlphaGo/AlphaZero as inspiration to design a searcher for Morpion Solitaire. A challenge of Morpion Solitaire is that the state space is sparse, there are few win/loss signals. Instead, we use an approach known as ranked reward to create a reinforcement learning self-play framework for Morpion Solitaire. This enables us to find medium-quality solutions with reasonable computational effort. Our record is a 67 steps solution, which is very close to the human best (68) without any other adaptation to the problem than using ranked reward. We list many further avenues for potential improvement.

下载PDF全文

下载文献需遵守相关版权规定

论文标题