通过一致性约束和删除签名者的连续手语识别来改善连续的手语识别

论文标题

通过一致性约束和删除签名者的连续手语识别来改善连续的手语识别

Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal

论文作者

Zuo, Ronglai, Mak, Brian

论文摘要

大多数基于深度学习的连续语言识别（CSLR）模型共享一个相似的主链，该骨干由视觉模块，顺序模块和对齐模块组成。但是，由于训练样本有限，连接派的时间分类损失可能无法充分训练此类CSLR主链。在这项工作中，我们提出了三项辅助任务，以增强CSLR骨架。从一致性的角度来看，第一个任务增强了视觉模块，该模块对训练问题不足。具体而言，由于符号语言的信息主要包含在签名者的面部表情和手部动作中，因此开发了按键引导的空间注意模块，以强制执行视觉模块，以专注于信息性区域，即空间注意力一致性。其次，注意到视觉和顺序模块的两个输出特征代表相同的句子，以更好地利用骨干的力量，在视觉和顺序模块之间施加了嵌入一致性约束的句子，以增强这两个功能的表示能力。我们将接受上述辅助任务训练的CSLR模型命名为一致性增强的CSLR，该模型在签名依赖性数据集中表现良好，在训练和测试过程中，所有签名者都会出现所有签名者。为了使独立于签名的设置更加健壮，进一步提出了基于特征分离的签名删除模块，以从主链中删除签名者信息。进行了广泛的消融研究，以验证这些辅助任务的有效性。更引人注目的是，我们的模型具有基于变压器的骨干，可以在五个基准测试中实现最先进或竞争性能，即凤凰城2014年，凤凰城2014-T，Phoenix-2014-Si，CSL和CSL，每日。代码和型号可在https://github.com/2000zrl/lcsa_c2slr_srm上找到。

Most deep-learning-based continuous sign language recognition (CSLR) models share a similar backbone consisting of a visual module, a sequential module, and an alignment module. However, due to limited training samples, a connectionist temporal classification loss may not train such CSLR backbones sufficiently. In this work, we propose three auxiliary tasks to enhance the CSLR backbones. The first task enhances the visual module, which is sensitive to the insufficient training problem, from the perspective of consistency. Specifically, since the information of sign languages is mainly included in signers' facial expressions and hand movements, a keypoint-guided spatial attention module is developed to enforce the visual module to focus on informative regions, i.e., spatial attention consistency. Second, noticing that both the output features of the visual and sequential modules represent the same sentence, to better exploit the backbone's power, a sentence embedding consistency constraint is imposed between the visual and sequential modules to enhance the representation power of both features. We name the CSLR model trained with the above auxiliary tasks as consistency-enhanced CSLR, which performs well on signer-dependent datasets in which all signers appear during both training and testing. To make it more robust for the signer-independent setting, a signer removal module based on feature disentanglement is further proposed to remove signer information from the backbone. Extensive ablation studies are conducted to validate the effectiveness of these auxiliary tasks. More remarkably, with a transformer-based backbone, our model achieves state-of-the-art or competitive performance on five benchmarks, PHOENIX-2014, PHOENIX-2014-T, PHOENIX-2014-SI, CSL, and CSL-Daily. Code and Models are available at https://github.com/2000ZRL/LCSA_C2SLR_SRM.

下载PDF全文

下载文献需遵守相关版权规定

论文标题