论文标题
通过统一学习方案和动态范围最小化来增强多模式电子商务属性属性提取价值提取
Boosting Multi-Modal E-commerce Attribute Value Extraction via Unified Learning Scheme and Dynamic Range Minimization
论文作者
论文摘要
随着电子商务行业的繁荣,将各种方式(例如愿景和语言)用于描述产品项目。了解这种多样化的数据是一个巨大的挑战,尤其是通过有用的图像区域提取文本序列中的属性 - 值对。尽管以前的一系列作品已致力于这项任务,但仍有很少研究的障碍阻碍了进一步的进步:1)上流单模式预处理的参数不足,而没有在下游多模态任务中进行适当的联合微调。 2)要选择图像的描述性部分,不管先验与语言相关的信息应被更强的编码器编码为常见的语言嵌入空间,因此广泛应用了简单的晚期融合。 3)由于产品之间的多样性,它们的属性集往往差异很大,但是当前的方法以不必要的最大范围预测,并带来更多潜在的假阳性。为了解决这些问题,我们在本文中提出了一种新的方法,可以通过统一的学习方案和动态范围最小化来提高多模式电子商务属性提取价值提取:1)首先,统一方案旨在通过预知的单模式参数共同培训多模式任务。 2)其次,提出了一种文本引导的信息范围最小化方法,以将每种模态的描述性部分自适应地编码为具有强大审慎的语言模型的相同空间。 3)此外,提出了一种原型引导的属性范围最小化方法,以首先确定当前产品的适当属性集,然后选择原型以指导所选属性的预测。对流行的多模式电子商务基准的实验表明,我们的方法比其他最新技术实现了卓越的性能。
With the prosperity of e-commerce industry, various modalities, e.g., vision and language, are utilized to describe product items. It is an enormous challenge to understand such diversified data, especially via extracting the attribute-value pairs in text sequences with the aid of helpful image regions. Although a series of previous works have been dedicated to this task, there remain seldomly investigated obstacles that hinder further improvements: 1) Parameters from up-stream single-modal pretraining are inadequately applied, without proper jointly fine-tuning in a down-stream multi-modal task. 2) To select descriptive parts of images, a simple late fusion is widely applied, regardless of priori knowledge that language-related information should be encoded into a common linguistic embedding space by stronger encoders. 3) Due to diversity across products, their attribute sets tend to vary greatly, but current approaches predict with an unnecessary maximal range and lead to more potential false positives. To address these issues, we propose in this paper a novel approach to boost multi-modal e-commerce attribute value extraction via unified learning scheme and dynamic range minimization: 1) Firstly, a unified scheme is designed to jointly train a multi-modal task with pretrained single-modal parameters. 2) Secondly, a text-guided information range minimization method is proposed to adaptively encode descriptive parts of each modality into an identical space with a powerful pretrained linguistic model. 3) Moreover, a prototype-guided attribute range minimization method is proposed to first determine the proper attribute set of the current product, and then select prototypes to guide the prediction of the chosen attributes. Experiments on the popular multi-modal e-commerce benchmarks show that our approach achieves superior performance over the other state-of-the-art techniques.