论文标题
高模式多模式变压器:高模式表示学习的量化模态和互动异质性
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
论文作者
论文摘要
许多现实世界的问题本质上是多模式的,从口语,手势和副语言学人类用来沟通,强迫,本体感受和视觉传感器上的人。尽管对多模式学习引起了人们的兴趣,但这些方法主要集中在一系列语言,视觉和音频的方式上。为了加速对多种多样和研究的方式的概括,本文研究了有效的代表性学习,以学习涉及各种不同方式的高模式场景。由于为每种新模式添加新模型都变得非常昂贵,因此一个关键的技术挑战是异质性量化:我们如何衡量哪些模式编码相似的信息和交互以允许使用以前的方式共享参数?本文提出了两个用于异质性量化的新信息理论指标:(1)模态异质性研究如何通过测量可以测量多少信息可以从x1转移到x2,而(2)相互作用异质性研究是如何通过测量x1的相似性{将{x1,x2}定为{x3,x4}。我们展示了这两个提出的指标的重要性,以此作为一种自动优先考虑包含独特信息或相互作用的模式融合的方法。结果是单个模型HighMMT,最多可扩展10种模式(文本,图像,音频,视频,传感器,本体感受,语音,时间序列,集合和表格)以及5个研究领域的15个任务。 HighMMT不仅在性能和效率之间的权衡方面都优于先验方法,而且还表明了至关重要的缩放行为:随着每种方式添加的每种方式,性能继续提高,并且在微调过程中转移到了全新的方式和任务。
Many real-world problems are inherently multimodal, from spoken language, gestures, and paralinguistics humans use to communicate, to force, proprioception, and visual sensors on robots. While there has been an explosion of interest in multimodal learning, these methods are focused on a small set of modalities primarily in language, vision, and audio. In order to accelerate generalization towards diverse and understudied modalities, this paper studies efficient representation learning for high-modality scenarios involving a large set of diverse modalities. Since adding new models for every new modality becomes prohibitively expensive, a critical technical challenge is heterogeneity quantification: how can we measure which modalities encode similar information and interactions in order to permit parameter sharing with previous modalities? This paper proposes two new information theoretic metrics for heterogeneity quantification: (1) modality heterogeneity studies how similar 2 modalities {X1,X2} are by measuring how much information can be transferred from X1 to X2, while (2) interaction heterogeneity studies how similarly pairs of modalities {X1,X2}, {X3,X4} interact by measuring how much information can be transferred from fusing {X1,X2} to {X3,X4}. We show the importance of these 2 proposed metrics as a way to automatically prioritize the fusion of modalities that contain unique information or interactions. The result is a single model, HighMMT, that scales up to 10 modalities (text, image, audio, video, sensors, proprioception, speech, time-series, sets, and tables) and 15 tasks from 5 research areas. Not only does HighMMT outperform prior methods on the tradeoff between performance and efficiency, it also demonstrates a crucial scaling behavior: performance continues to improve with each modality added, and it transfers to entirely new modalities and tasks during fine-tuning.