TSPNET：通过时间语义金字塔进行手语翻译的分层功能学习

论文标题

TSPNET：通过时间语义金字塔进行手语翻译的分层功能学习

TSPNet: Hierarchical Feature Learning via Temporal Semantic Pyramid for Sign Language Translation

论文作者

Li, Dongxu, Xu, Chenchen, Yu, Xin, Zhang, Kaihao, Swift, Ben, Suominen, Hanna, Li, Hongdong

论文摘要

手语翻译（SLT）旨在将标志视频序列解释为基于文本的自然语言句子。标志视频由连续的标志手势序列组成，两者之间没有明确的边界。现有的SLT模型通常以框架方式表示标志视觉功能，以免将视频明确分割为孤立的标志。但是，这些方法忽略了符号的时间信息，并导致翻译的歧义。在本文中，我们探讨了Signvideos的时间语义结构，以学习更多的判别特征。为此，我们首先提出了一个新颖的标志视频片段表示形式，该表示考虑了多种时间粒度，从而减轻了对准确的视频细分的需求。利用所提出的段表示，我们通过时空语义金字塔网络（称为TSPNet）开发了一种新型的层次符号视频学习方法。具体而言，TSPNET引入了尺度的关注，以评估和增强符号段的局部语义一致性，并通过使用非本地视频上下文来解决语义歧义。实验表明，我们的TSPNET在最大常用的SLT数据集上的BLEU得分（从9.58到13.41）（从9.58到13.41）（从9.58到13.41）（从31.80到34.96）胜过最先进的SPNET。我们的实施可在https://github.com/verashira/tspnet上获得。

Sign language translation (SLT) aims to interpret sign video sequences into text-based natural language sentences. Sign videos consist of continuous sequences of sign gestures with no clear boundaries in between. Existing SLT models usually represent sign visual features in a frame-wise manner so as to avoid needing to explicitly segmenting the videos into isolated signs. However, these methods neglect the temporal information of signs and lead to substantial ambiguity in translation. In this paper, we explore the temporal semantic structures of signvideos to learn more discriminative features. To this end, we first present a novel sign video segment representation which takes into account multiple temporal granularities, thus alleviating the need for accurate video segmentation. Taking advantage of the proposed segment representation, we develop a novel hierarchical sign video feature learning method via a temporal semantic pyramid network, called TSPNet. Specifically, TSPNet introduces an inter-scale attention to evaluate and enhance local semantic consistency of sign segments and an intra-scale attention to resolve semantic ambiguity by using non-local video context. Experiments show that our TSPNet outperforms the state-of-the-art with significant improvements on the BLEU score (from 9.58 to 13.41) and ROUGE score (from 31.80 to 34.96)on the largest commonly-used SLT dataset. Our implementation is available at https://github.com/verashira/TSPNet.

下载PDF全文

下载文献需遵守相关版权规定

论文标题