利用强化学习将机器人群放到目标分布

论文标题

利用强化学习将机器人群放到目标分布

Using Reinforcement Learning to Herd a Robotic Swarm to a Target Distribution

论文作者

Kakish, Zahi M., Elamvazhuthi, Karthik, Berman, Spring

论文摘要

在本文中，我们提出了一种增强学习方法，用于设计“领导者”代理的控制策略，该方法通过排斥相互作用捕食一群“追随者”代理，并尽快到牢固连接的图表上的目标概率分布。领导者控制策略是群体分布的函数，该策略以普通差方程的形式根据平均场模型随着时间的流逝而演变。策略对每个图顶点的代理种群而不是单个代理活动的依赖性简化了领导者所需的观察结果，并使控制策略能够随着代理的数量扩展。两种时间差异学习算法SARSA和Q学习算法用于基于追随者代理分布以及领导者在图表上的位置来生成领导者控制策略。使用4个顶点的网格图对应的模拟环境用于训练并验证从10到100不等的追随者人群的控制策略。最后，对100个模拟试剂进行训练的控制策略被用来成功地将10个小机器人的物理群重新分配到4个空间区域之间的目标分布。

In this paper, we present a reinforcement learning approach to designing a control policy for a "leader" agent that herds a swarm of "follower" agents, via repulsive interactions, as quickly as possible to a target probability distribution over a strongly connected graph. The leader control policy is a function of the swarm distribution, which evolves over time according to a mean-field model in the form of an ordinary difference equation. The dependence of the policy on agent populations at each graph vertex, rather than on individual agent activity, simplifies the observations required by the leader and enables the control strategy to scale with the number of agents. Two Temporal-Difference learning algorithms, SARSA and Q-Learning, are used to generate the leader control policy based on the follower agent distribution and the leader's location on the graph. A simulation environment corresponding to a grid graph with 4 vertices was used to train and validate the control policies for follower agent populations ranging from 10 to 100. Finally, the control policies trained on 100 simulated agents were used to successfully redistribute a physical swarm of 10 small robots to a target distribution among 4 spatial regions.

下载PDF全文

下载文献需遵守相关版权规定

论文标题