论文标题
使用多头神经网络的视障人士的协助,有效的手势识别
Efficient Gesture Recognition for the Assistance of Visually Impaired People using Multi-Head Neural Networks
论文作者
论文摘要
本文提出了一个由手势控制的移动设备的交互式系统,旨在帮助视觉障碍者。该系统允许用户通过制作简单的静态和动态手势来与设备进行交互。每个手势都会触发系统中的不同动作,例如对象识别,场景描述或图像缩放(例如,将手指指向对象将显示一个描述)。该系统基于多头神经网络体系结构,该架构最初检测和对手势进行了分类,随后,根据所检测到的手势,执行了执行相应动作的第二阶段。这种多头体系结构优化了同时执行不同任务所需的资源,并利用从初始骨架获得的信息在第二阶段中执行不同的过程。为了训练和评估系统,手动编译了一个带有约40k图像的数据集,包括不同类型的手势,背景,背景(室内和室外),照明条件等。此数据集包含合成手势(其目的是预先处理系统以改进结果)和使用不同的移动电话捕获的实际图像。获得的结果以及与最先进的状态进行的比较有关系统执行的不同动作,例如手势的分类和定位的准确性,或者对对象和场景的描述产生。
This paper proposes an interactive system for mobile devices controlled by hand gestures aimed at helping people with visual impairments. This system allows the user to interact with the device by making simple static and dynamic hand gestures. Each gesture triggers a different action in the system, such as object recognition, scene description or image scaling (e.g., pointing a finger at an object will show a description of it). The system is based on a multi-head neural network architecture, which initially detects and classifies the gestures, and subsequently, depending on the gesture detected, performs a second stage that carries out the corresponding action. This multi-head architecture optimizes the resources required to perform different tasks simultaneously, and takes advantage of the information obtained from an initial backbone to perform different processes in a second stage. To train and evaluate the system, a dataset with about 40k images was manually compiled and labeled including different types of hand gestures, backgrounds (indoors and outdoors), lighting conditions, etc. This dataset contains synthetic gestures (whose objective is to pre-train the system in order to improve the results) and real images captured using different mobile phones. The results obtained and the comparison made with the state of the art show competitive results as regards the different actions performed by the system, such as the accuracy of classification and localization of gestures, or the generation of descriptions for objects and scenes.