Starcraft II中具有人类专业知识的层次增强学习

论文标题

Starcraft II中具有人类专业知识的层次增强学习

Hierarchical Reinforcement Learning in StarCraft II with Human Expertise in Subgoals Selection

论文作者

Xu, Xinyi, Huang, Tiancheng, Wei, Pengfei, Narayan, Akshay, Leong, Tze-Yun

论文摘要

这项工作的灵感来自于分层增强学习的最新进展（HRL）（Barto and Mahadevan 2003; Hengst 2010），以及从基于启发式的子观念选择，经验重播（Lin 1993; Andrychowicz等，2017）以及基于任务的课程学习（Bengio et andba et al.2009; ZAREBA; 2009; ZAREBA; 2009; ZARBA; ZARBA; ZARBA; ZARBA; ZARBA; 2009; SAREBA; ZARBA; ZARBA; ZARBA; ZARBA; ZARBA; s; s; ZARBA; 2009; s s;我们提出了一种新方法，通过基于人类专业知识的隐式课程设计来整合HRL，经验重播和有效的子目标选择，以支持样品有效学习并增强对代理商行为的可解释性。人类的专业知识在许多领域（例如医学（Buch，Ahmed和Maruthappu 2018）和法律（CATH 2018）等许多领域仍然必不可少，在这种情况下，出于道德和法律原因，可解释性，解释性和透明度对于决策过程至关重要。我们的方法通过将其分解为不同级别的抽象级别的子目标来简化复杂的任务集，以实现整体目标。合并相关的主观知识还大大减少了RL探索中花费的计算资源，尤其是在高速，变化和复杂的环境中，无法在短时间内有效地学习和建模过渡动态。实验结果在两个Starcraft II（SC2）（Vinyals等人，2017年）中表明，我们的方法比平面和端到端RL方法可以实现更好的样品效率，并提供了解释剂性能的有效方法。

This work is inspired by recent advances in hierarchical reinforcement learning (HRL) (Barto and Mahadevan 2003; Hengst 2010), and improvements in learning efficiency from heuristic-based subgoal selection, experience replay (Lin 1993; Andrychowicz et al. 2017), and task-based curriculum learning (Bengio et al. 2009; Zaremba and Sutskever 2014). We propose a new method to integrate HRL, experience replay and effective subgoal selection through an implicit curriculum design based on human expertise to support sample-efficient learning and enhance interpretability of the agent's behavior. Human expertise remains indispensable in many areas such as medicine (Buch, Ahmed, and Maruthappu 2018) and law (Cath 2018), where interpretability, explainability and transparency are crucial in the decision making process, for ethical and legal reasons. Our method simplifies the complex task sets for achieving the overall objectives by decomposing them into subgoals at different levels of abstraction. Incorporating relevant subjective knowledge also significantly reduces the computational resources spent in exploration for RL, especially in high speed, changing, and complex environments where the transition dynamics cannot be effectively learned and modelled in a short time. Experimental results in two StarCraft II (SC2) (Vinyals et al. 2017) minigames demonstrate that our method can achieve better sample efficiency than flat and end-to-end RL methods, and provides an effective method for explaining the agent's performance.

下载PDF全文

下载文献需遵守相关版权规定

论文标题