培训策略以处理视听表达识别的缺失方式

论文标题

培训策略以处理视听表达识别的缺失方式

Training Strategies to Handle Missing Modalities for Audio-Visual Expression Recognition

论文作者

Parthasarathy, Srinivas, Sundaram, Shiva

论文摘要

自动视听表达识别可以在电信，VoIP呼叫和人机互动等通信服务中发挥重要作用。视听表达识别的准确性可以受益于两种方式之间的相互作用。但是，大多数在理想条件下受过训练的视听表达识别系统在现实世界中未能概括，在现实世界中，由于有限的带宽，交互者的方向，呼叫者启动静音，因此可能缺少音频或视觉方式。本文研究了一种最先进的变压器的性能，当时缺少其中一种方式。我们进行消融研究以在没有任何两种方式的情况下评估模型。此外，我们提出了一种策略，以在剪辑或框架级别的训练中随机消融视觉输入，以模仿现实世界的情况。在野外数据上进行的结果表明，在训练缺失线索的训练的拟议模型中，框架水平消融的提高了17％，这表明这些训练策略随着输入方式的丧失而更适合。

Automatic audio-visual expression recognition can play an important role in communication services such as tele-health, VOIP calls and human-machine interaction. Accuracy of audio-visual expression recognition could benefit from the interplay between the two modalities. However, most audio-visual expression recognition systems, trained in ideal conditions, fail to generalize in real world scenarios where either the audio or visual modality could be missing due to a number of reasons such as limited bandwidth, interactors' orientation, caller initiated muting. This paper studies the performance of a state-of-the art transformer when one of the modalities is missing. We conduct ablation studies to evaluate the model in the absence of either modality. Further, we propose a strategy to randomly ablate visual inputs during training at the clip or frame level to mimic real world scenarios. Results conducted on in-the-wild data, indicate significant generalization in proposed models trained on missing cues, with gains up to 17% for frame level ablations, showing that these training strategies cope better with the loss of input modalities.

下载PDF全文

下载文献需遵守相关版权规定

论文标题