在物理启发的半马尔可夫环境中的强化学习

论文标题

在物理启发的半马尔可夫环境中的强化学习

Reinforcement Learning in a Physics-Inspired Semi-Markov Environment

论文作者

Bellinger, Colin, Coles, Rory, Crowley, Mark, Tamblyn, Isaac

论文摘要

已证明强化学习（RL）在科学发现和设计的许多应用中具有巨大的潜力。最近的工作包括例如，用于治疗药物的新结构和分子组成的设计。但是，与RL应用于科学领域有关的许多现有工作都假定可用的状态代表遵守马尔可夫的财产。由于与时间，成本，传感器准确性以及科学知识中的差距相关的原因，许多科学设计和发现问题无法满足马尔可夫的财产。因此，应使用马尔可夫决策过程（MDP）以外的其他东西来计划 /找到最佳策略。在本文中，我们提出了一个受物理风格的半马科RL环境，即相变环境。此外，我们评估了在拟议的环境上评估MDP和部分可观察到的MDP（POMDP）的基于价值的RL算法的性能。我们的结果表明，深层Q-NETWORKS（DRQN）的表现明显优于Q-Networks（DQN），并且DRQN从Hindsight Experience重播中受益。还讨论了对使用半马克维亚RL和POMDP对科学实验室的影响。

Reinforcement learning (RL) has been demonstrated to have great potential in many applications of scientific discovery and design. Recent work includes, for example, the design of new structures and compositions of molecules for therapeutic drugs. Much of the existing work related to the application of RL to scientific domains, however, assumes that the available state representation obeys the Markov property. For reasons associated with time, cost, sensor accuracy, and gaps in scientific knowledge, many scientific design and discovery problems do not satisfy the Markov property. Thus, something other than a Markov decision process (MDP) should be used to plan / find the optimal policy. In this paper, we present a physics-inspired semi-Markov RL environment, namely the phase change environment. In addition, we evaluate the performance of value-based RL algorithms for both MDPs and partially observable MDPs (POMDPs) on the proposed environment. Our results demonstrate deep recurrent Q-networks (DRQN) significantly outperform deep Q-networks (DQN), and that DRQNs benefit from training with hindsight experience replay. Implications for the use of semi-Markovian RL and POMDPs for scientific laboratories are also discussed.

下载PDF全文

下载文献需遵守相关版权规定

论文标题