用反馈图进行加固学习

论文标题

用反馈图进行加固学习

Reinforcement Learning with Feedback Graphs

论文作者

Dann, Christoph, Mansour, Yishay, Mohri, Mehryar, Sekhari, Ayush, Sridharan, Karthik

论文摘要

我们研究了马尔可夫决策过程中的情节增强学习，当代理以几个过渡观测的形式获得每个步骤的其他反馈。通过扩展传感器或有关环境的先验知识，可以在一系列任务中获得此类额外的观察（例如，当某些动作产生相似的结果时）。我们使用反馈图对状态成对进行了形式化此设置，并表明基于模型的算法可以利用额外的反馈来获得更多样本效率学习。我们给予遗憾的是，忽略对数因素和低阶项，仅取决于反馈图的最大无环子图的大小，与在没有反馈图的情况下对状态数量和行动的多项式依赖相反。最后，我们强调了在利用小规划式设置的一小部分反馈图集时，我们将提出挑战，并提出了一种新算法，该算法可以使用这种主导集合的知识来对近乎最佳的政策进行更有能力地学习。

We study episodic reinforcement learning in Markov decision processes when the agent receives additional feedback per step in the form of several transition observations. Such additional observations are available in a range of tasks through extended sensors or prior knowledge about the environment (e.g., when certain actions yield similar outcome). We formalize this setting using a feedback graph over state-action pairs and show that model-based algorithms can leverage the additional feedback for more sample-efficient learning. We give a regret bound that, ignoring logarithmic factors and lower-order terms, depends only on the size of the maximum acyclic subgraph of the feedback graph, in contrast with a polynomial dependency on the number of states and actions in the absence of a feedback graph. Finally, we highlight challenges when leveraging a small dominating set of the feedback graph as compared to the bandit setting and propose a new algorithm that can use knowledge of such a dominating set for more sample-efficient learning of a near-optimal policy.

下载PDF全文

下载文献需遵守相关版权规定

论文标题