论文标题

仔细观察:与机器人操纵的变压器桥接以自我为中心和第三人称视角

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

论文作者

Jangir, Rishabh, Hansen, Nicklas, Ghosal, Sambaran, Jain, Mohit, Wang, Xiaolong

论文摘要

学习通过增强学习(RL)从视觉反馈中解决基于精确的操纵任务可以大大减少传统机器人系统所需的工程工作。但是,仅通过视觉输入进行细粒度的电机控制是具有挑战性的,尤其是在先前工作中经常使用的静态第三人称摄像机。我们提出了一个机器人操作的设置,其中代理商从第三人称摄像机和安装在机器人手腕上的以egin摄像头的相机接收了视觉反馈。尽管第三人称摄像头是静态的,但以egentric的相机使机器人能够积极控制其愿景,以帮助精确操纵。为了有效地从两个相机中融合视觉信息,我们还建议将变压器与跨视图的注意机制一起使用,将空间注意力从一个视图到另一个视图(反之亦然),并将学习的功能用作RL策略的输入。我们的方法改善了对强大的单视图和多视图基准的学习,并成功地转移到具有未校准相机的真实机器人上,无法访问状态信息以及高度的任务可变性。在锤子操纵任务中,我们的方法在75%的试验中取得了成功,分别为38%和13%的多视图和单视图基线。

Learning to solve precision-based manipulation tasks from visual feedback using Reinforcement Learning (RL) could drastically reduce the engineering efforts required by traditional robot systems. However, performing fine-grained motor control from visual inputs alone is challenging, especially with a static third-person camera as often used in previous work. We propose a setting for robotic manipulation in which the agent receives visual feedback from both a third-person camera and an egocentric camera mounted on the robot's wrist. While the third-person camera is static, the egocentric camera enables the robot to actively control its vision to aid in precise manipulation. To fuse visual information from both cameras effectively, we additionally propose to use Transformers with a cross-view attention mechanism that models spatial attention from one view to another (and vice-versa), and use the learned features as input to an RL policy. Our method improves learning over strong single-view and multi-view baselines, and successfully transfers to a set of challenging manipulation tasks on a real robot with uncalibrated cameras, no access to state information, and a high degree of task variability. In a hammer manipulation task, our method succeeds in 75% of trials versus 38% and 13% for multi-view and single-view baselines, respectively.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源