人AI互动环训练：互动增强学习的新方法

论文标题

人AI互动环训练：互动增强学习的新方法

Human AI interaction loop training: New approach for interactive reinforcement learning

论文作者

Navidi, Neda

论文摘要

在机器学习的各种决策任务中，强化学习（RL）通过从独立的奖励功能中学习提供了有效的结果。但是，它在大量环境状态和行动空间以及奖励的确定中提出了独特的挑战。这种复杂性来自本文所考虑的环境的高维度和连续性，要求进行大量的学习试验，以通过强化学习来学习环境。模仿学习（IL）为使用老师面临的挑战提供了有希望的解决方案。在IL中，学习过程可以利用人为的援助和/或控制代理和环境。在这项研究中考虑了人类老师和特工学习者。老师参加了代理培训，以应对环境，解决特定的目标并实现预定义的目标。但是，在该范式中，现有的IL方法的缺点是期望在长期问题中提供广泛的演示信息。本文提出了一种新颖的方法，将IL与不同类型的RL方法相结合，即状态行动奖励状态行动（SARSA）和异步优势参与者 - 批评者（A3C）代理，以克服两种独立系统的问题。它被解决了如何有效利用教师反馈的方法，无论是直接二进制还是间接详细的详细信息，供特工学习者学习顺序决策政策。这项在各种OpenAI健身环境中的研究结果表明，这种算法方法可以与不同的组合合并，从而大大降低了人类努力和繁琐的探索过程。

Reinforcement Learning (RL) in various decision-making tasks of machine learning provides effective results with an agent learning from a stand-alone reward function. However, it presents unique challenges with large amounts of environment states and action spaces, as well as in the determination of rewards. This complexity, coming from high dimensionality and continuousness of the environments considered herein, calls for a large number of learning trials to learn about the environment through Reinforcement Learning. Imitation Learning (IL) offers a promising solution for those challenges using a teacher. In IL, the learning process can take advantage of human-sourced assistance and/or control over the agent and environment. A human teacher and an agent learner are considered in this study. The teacher takes part in the agent training towards dealing with the environment, tackling a specific objective, and achieving a predefined goal. Within that paradigm, however, existing IL approaches have the drawback of expecting extensive demonstration information in long-horizon problems. This paper proposes a novel approach combining IL with different types of RL methods, namely state action reward state action (SARSA) and asynchronous advantage actor-critic (A3C) agents, to overcome the problems of both stand-alone systems. It is addressed how to effectively leverage the teacher feedback, be it direct binary or indirect detailed for the agent learner to learn sequential decision-making policies. The results of this study on various OpenAI Gym environments show that this algorithmic method can be incorporated with different combinations, significantly decreases both human endeavor and tedious exploration process.

下载PDF全文

下载文献需遵守相关版权规定

论文标题