通过硬件“扩展”加速推荐系统

论文标题

通过硬件“扩展”加速推荐系统

Accelerating Recommender Systems via Hardware "scale-in"

论文作者

Krishna, Suresh, Krishna, Ravi

论文摘要

在当今的“扩大规模”时代，本文提出了一种基于“扩展”的专业硬件体系结构 - 将尽可能多的专业处理器及其内存系统以及架子中的一个或两个木板内的互连链接放置在机架中 - 将提供可能将大型推荐人置于12-62x的训练中，并将其用于12-62x的训练，并以12-45x的培训，并将其置于12-45X的培训，并将其开发为12-45X，并将其用于dgx-2-dgx-2-dgx-2-在多个处理器上分发大型模型。通过分析Facebook的代表性模型 - 深度学习建议模型（DLRM） - 从硬件体系结构的角度来看，我们量化了对硬件参数吞吐量的影响，例如内存系统设计，集体通信延迟和带宽以及互连拓扑。通过专注于强调硬件的条件，我们的分析揭示了现有的AI加速器和硬件平台的局限性。

In today's era of "scale-out", this paper makes the case that a specialized hardware architecture based on "scale-in"--placing as many specialized processors as possible along with their memory systems and interconnect links within one or two boards in a rack--would offer the potential to boost large recommender system throughput by 12-62x for inference and 12-45x for training compared to the DGX-2 state-of-the-art AI platform, while minimizing the performance impact of distributing large models across multiple processors. By analyzing Facebook's representative model--Deep Learning Recommendation Model (DLRM)--from a hardware architecture perspective, we quantify the impact on throughput of hardware parameters such as memory system design, collective communications latency and bandwidth, and interconnect topology. By focusing on conditions that stress hardware, our analysis reveals limitations of existing AI accelerators and hardware platforms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题