RLCFR：通过深入的强化学习来最大程度地减少反事实的遗憾

论文标题

RLCFR：通过深入的强化学习来最大程度地减少反事实的遗憾

RLCFR: Minimize Counterfactual Regret by Deep Reinforcement Learning

论文作者

Li, Huale, Wang, Xuan, Jia, Fengwei, Li, Yifan, Wu, Yulin, Zhang, Jiajia, Qi, Shuhan

论文摘要

反事实遗憾最小化（CFR）是一种流行的方法，可以通过不完美的信息处理两人零和游戏的决策问题。与现有的研究主要探索用于解决更大规模问题或加速解决方案效率的研究不同，我们提出了一个框架RLCFR，旨在提高CFR方法的概括能力。在RLCFR中，CFR在强化学习框架中解决了游戏策略。迭代交互式策略更新的动态过程被建模为马尔可夫决策过程（MDP）。然后，我们的方法RLCFR学习了一项策略，以选择在迭代过程中选择适当的遗憾方式。此外，制定了逐步奖励功能以学习行动策略，这与迭代策略在每个步骤中的表现成正比。各种游戏的广泛实验结果表明，与现有的最新方法相比，我们方法的概括能力得到了显着提高。

Counterfactual regret minimization (CFR) is a popular method to deal with decision-making problems of two-player zero-sum games with imperfect information. Unlike existing studies that mostly explore for solving larger scale problems or accelerating solution efficiency, we propose a framework, RLCFR, which aims at improving the generalization ability of the CFR method. In the RLCFR, the game strategy is solved by the CFR in a reinforcement learning framework. And the dynamic procedure of iterative interactive strategy updating is modeled as a Markov decision process (MDP). Our method, RLCFR, then learns a policy to select the appropriate way of regret updating in the process of iteration. In addition, a stepwise reward function is formulated to learn the action policy, which is proportional to how well the iteration strategy is at each step. Extensive experimental results on various games have shown that the generalization ability of our method is significantly improved compared with existing state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题