多元宇宙：虚拟化高性能计算群集的动态VM配置

论文标题

多元宇宙：虚拟化高性能计算群集的动态VM配置

Multiverse: Dynamic VM Provisioning for Virtualized High Performance Computing Clusters

论文作者

Gunasekaran, Jashwant Raj, Cui, Michael, Thinakaran, Prashanth, Simons, Josh, Kandemir, Mahmut Taylan, Das, Chita R.

论文摘要

传统上，HPC的工作负载已被部署在裸金属集群中。但是虚拟化的进步导致了这些工作负载在虚拟化集群中部署的途径。但是，由于传统的HPC调度程序与VM Shillyvisor（资源管理层）之间缺乏协调，HPC集群管理员/提供商在资源弹性和虚拟机（VM）配置方面仍然面临挑战。这种缺乏互动会导致群集利用率较低和工作完成吞吐量。此外，VM供应延迟直接影响集群中工作的整体绩效。因此，需要有效地配置虚拟化的HPC簇，这可以通过最小的配置开销来最大程度地利用物理硬件。在此方面，我们提出了Multiverse，这是一个VM Provisioning框架，可以通过将HPC调度程序与VM Resource Manager集成，可以在虚拟化的HPC群集中动态催生VM，以在虚拟化的HPC群集中传入。我们已经在SLURM}调度程序以及VSPhere VM资源管理器上实现了此框架。为了减少VM配置开销，我们使用即时克隆，与完整的VM克隆相比，该克隆与父VM共享磁盘和内存，该克隆必须从SCRATCH启动新的VM。使用现实世界中HPC工作负载的测量结果表明，在VM配置时间方面，即时克隆的速度比完整克隆快2.5倍。此外，与Full Clone相比，它将资源利用率提高了40％，将吞吐量提高了1.5倍。

Traditionally, HPC workloads have been deployed in bare-metal clusters; but the advances in virtualization have led the pathway for these workloads to be deployed in virtualized clusters. However, HPC cluster administrators/providers still face challenges in terms of resource elasticity and virtual machine (VM) provisioning at large-scale, due to the lack of coordination between a traditional HPC scheduler and the VM hypervisor (resource management layer). This lack of interaction leads to low cluster utilization and job completion throughput. Furthermore, the VM provisioning delays directly impact the overall performance of jobs in the cluster. Hence, there is a need for effectively provisioning virtualized HPC clusters, which can best-utilize the physical hardware with minimal provisioning overheads. Towards this, we propose Multiverse, a VM provisioning framework, which can dynamically spawn VMs for incoming jobs in a virtualized HPC cluster, by integrating the HPC scheduler along with VM resource manager. We have implemented this framework on the Slurm} scheduler along with the vSphere VM resource manager. In order to reduce the VM provisioning overheads, we use instant cloning which shares both the disk and memory with the parent VM, when compared to full VM cloning which has to boot-up a new VM from scratch. Measurements with real-world HPC workloads demonstrate that, instant cloning is 2.5x faster than full cloning in terms of VM provisioning time. Further, it improves resource utilization by up to 40%, and cluster throughput by up to 1.5x, when compared to full clone for bursty job arrival scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题