论文标题
在增强学习中的有限内存资源的动态分配
Dynamic allocation of limited memory resources in reinforcement learning
论文作者
论文摘要
生物大脑在处理和存储信息的能力上本质上受到限制,但仍然能够轻松解决复杂的任务。智能行为与这些局限性有关,因为资源限制推动了对环境或过去经历的记忆中的特征或记忆的重要性的需求。最近,在强化学习和神经科学方面已经采取了平行的努力,以了解人工和生物学代理采取的策略以规避信息存储中的限制。但是,这两个线程在很大程度上是分开的。在本文中,我们提出了一个动态框架,以在有限资源的限制下最大化预期奖励,我们以成本函数来实施,以惩罚记忆中的动作值的精确表示,每种行动值的精度可能会有所不同。我们从第一原理中得出了算法,动态资源分配器(DRA),我们将其应用于强化学习和基于模型的计划任务的两个标准任务,并发现它将更多的资源分配给对累积奖励产生更高影响的内存项目中的更多资源。此外,DRA在从资源预算开始时学习的速度比最终分配在任务上表现良好的资源预算更快,这可能解释了为什么生物学大脑中的额叶皮质区域似乎更参与学习的早期阶段,然后再定居以降低渐近的活动水平。我们的工作为学习如何以能够适应环境变化的方式为不确定记忆的集合分配昂贵的资源的问题提供了一种规范的解决方案。
Biological brains are inherently limited in their capacity to process and store information, but are nevertheless capable of solving complex tasks with apparent ease. Intelligent behavior is related to these limitations, since resource constraints drive the need to generalize and assign importance differentially to features in the environment or memories of past experiences. Recently, there have been parallel efforts in reinforcement learning and neuroscience to understand strategies adopted by artificial and biological agents to circumvent limitations in information storage. However, the two threads have been largely separate. In this article, we propose a dynamical framework to maximize expected reward under constraints of limited resources, which we implement with a cost function that penalizes precise representations of action-values in memory, each of which may vary in its precision. We derive from first principles an algorithm, Dynamic Resource Allocator (DRA), which we apply to two standard tasks in reinforcement learning and a model-based planning task, and find that it allocates more resources to items in memory that have a higher impact on cumulative rewards. Moreover, DRA learns faster when starting with a higher resource budget than what it eventually allocates for performing well on tasks, which may explain why frontal cortical areas in biological brains appear more engaged in early stages of learning before settling to lower asymptotic levels of activity. Our work provides a normative solution to the problem of learning how to allocate costly resources to a collection of uncertain memories in a manner that is capable of adapting to changes in the environment.