强化学习

论文标题

Reinforcement Learning

论文作者

Buffet, Olivier, Pietquin, Olivier, Weng, Paul

论文摘要

强化学习（RL）是自适应控制的一般框架，事实证明，在许多领域，例如棋盘游戏，视频游戏或自动驾驶汽车，它在许多领域都非常有效。在这样的问题中，代理人面临一个顺序的决策问题，在每个时间步骤中，它都会观察其状态，执行诉讼，获得奖励并转移到新状态。 RL代理通过反复试验来学习一个基于观察结果和对先前执行的行动的数字奖励反馈的好政策（或控制者）。在本章中，我们介绍了RL的基本框架，并回顾了为学习良好政策而开发的两个主要方法。第一个基于价值的是估计最佳策略的价值，该价值可以从中恢复策略，而另一个称为策略搜索的价值直接在策略领域中工作。参与者 - 批评方法可以看作是一种政策搜索技术，其中所学的政策价值指导政策的改进。此外，我们概述了标准RL框架的一些扩展，特别是当需要考虑规避风险的行为或奖励时，奖励不可用或不知道时。

Reinforcement learning (RL) is a general framework for adaptive control, which has proven to be efficient in many domains, e.g., board games, video games or autonomous vehicles. In such problems, an agent faces a sequential decision-making problem where, at every time step, it observes its state, performs an action, receives a reward and moves to a new state. An RL agent learns by trial and error a good policy (or controller) based on observations and numeric reward feedback on the previously performed action. In this chapter, we present the basic framework of RL and recall the two main families of approaches that have been developed to learn a good policy. The first one, which is value-based, consists in estimating the value of an optimal policy, value from which a policy can be recovered, while the other, called policy search, directly works in a policy space. Actor-critic methods can be seen as a policy search technique where the policy value that is learned guides the policy improvement. Besides, we give an overview of some extensions of the standard RL framework, notably when risk-averse behavior needs to be taken into account or when rewards are not available or not known.

下载PDF全文

下载文献需遵守相关版权规定

论文标题

强化学习

Reinforcement Learning

论文作者

论文摘要

加入微信交流群