学习大规模的通用用户表示，专家的混合稀疏

论文标题

学习大规模的通用用户表示，专家的混合稀疏

Learning Large-scale Universal User Representation with Sparse Mixture of Experts

论文作者

Jiang, Caigao, Xue, Siqiao, Zhang, James, Liu, Lingyue, Zhu, Zhibo, Hao, Hongyan

论文摘要

由于随着时间的推移和用户功能的高尺寸，学习用户序列行为嵌入非常复杂且具有挑战性。最近的新兴基金会模型，例如伯特及其变体，鼓励大量研究人员在该领域进行调查。但是，与自然语言处理（NLP）任务不同，用户行为模型的参数主要来自用户嵌入层，这使得大多数现有作品在训练大规模的通用用户嵌入中失败。此外，从多个下游任务中学到了用户表示，并且过去的研究工作无法解决Seesaw现象。在本文中，我们提出了SuperMoe，这是一个通用框架，旨在从多个任务中获取高质量的用户表示。具体而言，用户行为序列是由MOE Transformer编码的，因此我们可以将模型容量提高到数十亿个参数，甚至可以将模型能力提高到数万亿个参数。为了在跨多个任务学习时处理Seesaw现象，我们使用任务指标设计了新的损失功能。我们在公共数据集和私人现实世界业务方案上进行了广泛的离线实验。我们的方法在最新模型中取得了最佳性能，结果证明了我们框架的有效性。

Learning user sequence behaviour embedding is very sophisticated and challenging due to the complicated feature interactions over time and high dimensions of user features. Recent emerging foundation models, e.g., BERT and its variants, encourage a large body of researchers to investigate in this field. However, unlike natural language processing (NLP) tasks, the parameters of user behaviour model come mostly from user embedding layer, which makes most existing works fail in training a universal user embedding of large scale. Furthermore, user representations are learned from multiple downstream tasks, and the past research work do not address the seesaw phenomenon. In this paper, we propose SUPERMOE, a generic framework to obtain high quality user representation from multiple tasks. Specifically, the user behaviour sequences are encoded by MoE transformer, and we can thus increase the model capacity to billions of parameters, or even to trillions of parameters. In order to deal with seesaw phenomenon when learning across multiple tasks, we design a new loss function with task indicators. We perform extensive offline experiments on public datasets and online experiments on private real-world business scenarios. Our approach achieves the best performance over state-of-the-art models, and the results demonstrate the effectiveness of our framework.

下载PDF全文

下载文献需遵守相关版权规定

论文标题