通过轨迹反馈进行加固学习

论文标题

通过轨迹反馈进行加固学习

Reinforcement Learning with Trajectory Feedback

论文作者

Efroni, Yonathan, Merlis, Nadav, Mannor, Shie

论文摘要

强化学习的标准反馈模型需要揭示每条访问的州行动对的奖励。但是，实际上，通常没有这种反馈的情况。在这项工作中，我们迈出了放松这一假设的第一步，需要较弱的反馈形式，我们将其称为\ emph {轨迹反馈}。我们没有观察到每个动作后获得的奖励，而是假设我们仅收到代表代理观察到的整个轨迹质量的分数，即，在此轨迹上获得的所有奖励之和。根据已知和未知过渡模型案例的最小二乘估计，我们将增强式学习算法扩展到了这种设置，并通过分析他们的遗憾来研究这些算法的性能。对于未知过渡模型的情况，我们提供了一种混合乐观的汤姆普森抽样方法，可导致可拖动算法。

The standard feedback model of reinforcement learning requires revealing the reward of every visited state-action pair. However, in practice, it is often the case that such frequent feedback is not available. In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as \emph{trajectory feedback}. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret. For cases where the transition model is unknown, we offer a hybrid optimistic-Thompson Sampling approach that results in a tractable algorithm.

下载PDF全文

下载文献需遵守相关版权规定

论文标题