在存在外源信息的情况下，样品有效的增强学习

论文标题

在存在外源信息的情况下，样品有效的增强学习

Sample-Efficient Reinforcement Learning in the Presence of Exogenous Information

论文作者

Efroni, Yonathan, Foster, Dylan J., Misra, Dipendra, Krishnamurthy, Akshay, Langford, John

论文摘要

在现实世界的强化学习应用中，学习者的观察空间无处不在，有关手头任务的相关信息和无关紧要。从高维观察中学习一直是监督学习和统计数据（例如，通过稀疏性）进行广泛调查的主题，但是即使在有限的状态/行动（表格）域中，强化学习中的类似问题也不是很好的。我们引入了一个新的问题设定，用于增强学习，即马尔可夫决策过程（EXOMDP），其中状态空间将（未知）分解成一个小的（或内源性）组件，并且很大的无关（或外源）组件；外源成分独立于学习者的行为，但以任意的，时间相关的方式演变。我们提供了一种新的算法Exorl，该算法学习了一种近乎最佳的策略，其样品复杂性多项式在内源性成分的大小中，几乎独立于外源成分的尺寸，从而提供了超出算法的双重指数改进。我们的结果首次在存在外源信息的情况下首次突出了样品有效的增强学习，并为未来的调查提供了简单，用户友好的基准。

In real-world reinforcement learning applications the learner's observation space is ubiquitously high-dimensional with both relevant and irrelevant information about the task at hand. Learning from high-dimensional observations has been the subject of extensive investigation in supervised learning and statistics (e.g., via sparsity), but analogous issues in reinforcement learning are not well understood, even in finite state/action (tabular) domains. We introduce a new problem setting for reinforcement learning, the Exogenous Markov Decision Process (ExoMDP), in which the state space admits an (unknown) factorization into a small controllable (or, endogenous) component and a large irrelevant (or, exogenous) component; the exogenous component is independent of the learner's actions, but evolves in an arbitrary, temporally correlated fashion. We provide a new algorithm, ExoRL, which learns a near-optimal policy with sample complexity polynomial in the size of the endogenous component and nearly independent of the size of the exogenous component, thereby offering a doubly-exponential improvement over off-the-shelf algorithms. Our results highlight for the first time that sample-efficient reinforcement learning is possible in the presence of exogenous information, and provide a simple, user-friendly benchmark for investigation going forward.

下载PDF全文

下载文献需遵守相关版权规定

论文标题