张量铸造：用于个性化推荐培训的共同设计算法架构

论文标题

张量铸造：用于个性化推荐培训的共同设计算法架构

Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training

论文作者

Kwon, Youngeun, Lee, Yunjae, Rhu, Minsoo

论文摘要

个性化建议是从云数据中心提供服务的最广泛部署的机器学习（ML）工作量之一。因此，用于高性能建议推断的建筑解决方案最近已成为几篇先前文献的目标。不幸的是，关于这个新兴ML工作量的培训方面，几乎没有探索和理解。在本文中，我们首先对培训建议进行详细的工作量表征研究，从而将稀疏嵌入层训练作为最重要的性能瓶颈之一。然后，我们提出了称为张量铸造的算法 - 建筑共同设计，该设计可以开发用于张量聚集筛的通用加速器架构，该构造涵盖了所有训练嵌入层的关键原始剂。当在真实的CPU-GPU系统上进行原型，张量铸件与最先进的方法相比，训练吞吐量可提高1.9-21x。

Personalized recommendations are one of the most widely deployed machine learning (ML) workload serviced from cloud datacenters. As such, architectural solutions for high-performance recommendation inference have recently been the target of several prior literatures. Unfortunately, little have been explored and understood regarding the training side of this emerging ML workload. In this paper, we first perform a detailed workload characterization study on training recommendations, root-causing sparse embedding layer training as one of the most significant performance bottlenecks. We then propose our algorithm-architecture co-design called Tensor Casting, which enables the development of a generic accelerator architecture for tensor gather-scatter that encompasses all the key primitives of training embedding layers. When prototyped on a real CPU-GPU system, Tensor Casting provides 1.9-21x improvements in training throughput compared to state-of-the-art approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题