论文标题
当变压器遇到机器人抓紧时:利用上下文以有效地抓住检测
When Transformer Meets Robotic Grasping: Exploits Context for Efficient Grasp Detection
论文作者
论文摘要
在本文中,我们提出了一种基于变压器的架构,即TF-Grasp,用于机器人抓取检测。开发的TF-Grasp框架具有两个精心设计的设计,使其非常适合视觉抓握任务。第一个关键设计是,我们采用本地窗口的关注来捕获可抓取对象的局部上下文信息和详细特征。然后,我们将跨窗户注意力应用于建模遥远像素之间的长期依赖性。对象知识,环境配置以及不同视觉实体之间的关系汇总以进行后续抓取检测。第二个关键设计是,我们构建了一个具有跳过连接的层次结构编码器架构,从编码器到解码器提供了浅特征,以启用多尺度功能融合。由于具有强大的注意机制,TF-Grasp可以同时获得局部信息(即对象的轮廓),并建模长期连接,例如混乱中不同的视觉概念之间的关系。广泛的计算实验表明,TF-GRASP在康奈尔(Cornell)和雅克(Jacquard)握把数据集上分别获得了较高的结果与最先进的卷积卷积模型,并获得了97.99%和94.6%的较高精度。使用7DOF Franka Emika Panda机器人进行的现实世界实验也证明了其在各种情况下抓住看不见的物体的能力。代码和预培训模型将在https://github.com/wangshaosun/grasp-transformer上找到
In this paper, we present a transformer-based architecture, namely TF-Grasp, for robotic grasp detection. The developed TF-Grasp framework has two elaborate designs making it well suitable for visual grasping tasks. The first key design is that we adopt the local window attention to capture local contextual information and detailed features of graspable objects. Then, we apply the cross window attention to model the long-term dependencies between distant pixels. Object knowledge, environmental configuration, and relationships between different visual entities are aggregated for subsequent grasp detection. The second key design is that we build a hierarchical encoder-decoder architecture with skip-connections, delivering shallow features from encoder to decoder to enable a multi-scale feature fusion. Due to the powerful attention mechanism, the TF-Grasp can simultaneously obtain the local information (i.e., the contours of objects), and model long-term connections such as the relationships between distinct visual concepts in clutter. Extensive computational experiments demonstrate that the TF-Grasp achieves superior results versus state-of-art grasping convolutional models and attain a higher accuracy of 97.99% and 94.6% on Cornell and Jacquard grasping datasets, respectively. Real-world experiments using a 7DoF Franka Emika Panda robot also demonstrate its capability of grasping unseen objects in a variety of scenarios. The code and pre-trained models will be available at https://github.com/WangShaoSUN/grasp-transformer