MISO：用于机器学习的多租户系统上的多企业GPU功能

论文标题

MISO：用于机器学习的多租户系统上的多企业GPU功能

MISO: Exploiting Multi-Instance GPU Capability on Multi-Tenant Systems for Machine Learning

论文作者

Li, Baolin, Patel, Tirthak, Samsi, Siddarth, Gadepally, Vijay, Tiwari, Devesh

论文摘要

GPU技术在尺寸和性能方面的加快速度一直在提高，赋予HPC和AI/ML研究人员的能力，以推进科学发现过程。但是，这也导致资源使用效率低下，因为大多数GPU工作负载（包括复杂的AI/ML模型）无法在最大程度上利用GPU资源 - 鼓励对GPU多租赁的支持。我们提出了Miso，这是一种利用最新的NVIDIA数据中心GPU（例如A100，H100）的多命名GPU（MIG）功能的技术，以在共同位置的工作中动态分配GPU资源。 Miso的主要见解是使用轻巧，更灵活的多进程服务（MPS）功能来预测不同作业的最佳MIG分区分配，而不会在探索过程中招致实施它们的间接费用。由于它具有更有效地利用GPU资源的能力，Miso分别比未分区和最佳的静态GPU分区方案的平均工作完成时间低49％和16％。

GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU workloads, including complicated AI/ML models, are not able to utilize the GPU resources to their fullest extent -- encouraging support for GPU multi-tenancy. We propose MISO, a technique to exploit the Multi-Instance GPU (MIG) capability on the latest NVIDIA datacenter GPUs (e.g., A100, H100) to dynamically partition GPU resources among co-located jobs. MISO's key insight is to use the lightweight, more flexible Multi-Process Service (MPS) capability to predict the best MIG partition allocation for different jobs, without incurring the overhead of implementing them during exploration. Due to its ability to utilize GPU resources more efficiently, MISO achieves 49% and 16% lower average job completion time than the unpartitioned and optimal static GPU partition schemes, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题