论文标题
使用基于人群的培训加速和改善Alphazero
Accelerating and Improving AlphaZero Using Population Based Training
论文作者
论文摘要
Alphazero在许多游戏中都非常成功。不幸的是,它仍然消耗大量的计算资源,其中大多数用于自我播放。高参数调整加剧了培训成本,因为每种超参数配置都需要自己的时间来训练一项运行,在此期间,它将生成自己的自我播放记录。结果,通常需要多次运行来进行不同的超参数配置。本文建议使用基于人群的培训(PBT),以动态调整超参数,并在训练时间内提高力量。另一个重要的优势是,这种方法仅需要一次运行,同时产生较小的额外时间成本,因为在Alphazero培训算法之后增加了优化的时间,生成自我播放记录的时间保持不变。在我们对9x9 GO的实验中,PBT方法能够比基线的9x9 GO获得更高的获胜率,每种方法都具有自己的超参数配置,并进行了单独训练。对于19x19 GO,使用PBT,我们能够获得进攻力量的改进。具体而言,PBT代理可以使用具有可比能力的神经网络Elf OpenGo获得高达74%的胜利率。这与饱和的非PBT代理相比,在相同情况下,对Elf OpenGo的获胜率为47%。
AlphaZero has been very successful in many games. Unfortunately, it still consumes a huge amount of computing resources, the majority of which is spent in self-play. Hyperparameter tuning exacerbates the training cost since each hyperparameter configuration requires its own time to train one run, during which it will generate its own self-play records. As a result, multiple runs are usually needed for different hyperparameter configurations. This paper proposes using population based training (PBT) to help tune hyperparameters dynamically and improve strength during training time. Another significant advantage is that this method requires a single run only, while incurring a small additional time cost, since the time for generating self-play records remains unchanged though the time for optimization is increased following the AlphaZero training algorithm. In our experiments for 9x9 Go, the PBT method is able to achieve a higher win rate for 9x9 Go than the baselines, each with its own hyperparameter configuration and trained individually. For 19x19 Go, with PBT, we are able to obtain improvements in playing strength. Specifically, the PBT agent can obtain up to 74% win rate against ELF OpenGo, an open-source state-of-the-art AlphaZero program using a neural network of a comparable capacity. This is compared to a saturated non-PBT agent, which achieves a win rate of 47% against ELF OpenGo under the same circumstances.