论文标题
NVIDIA V100 GPU上的8个步骤到3.7 Tflop/s:车顶线分析和其他技巧
8 Steps to 3.7 TFLOP/s on NVIDIA V100 GPU: Roofline Analysis and Other Tricks
论文作者
论文摘要
性能优化可能是一项艰巨的任务,尤其是当硬件体系结构变得越来越复杂时。本文从《材料科学守则》伯克利格(Berkeleygw)中获取了内核,并演示了一些性能分析和优化技术。尽管诸如高注册使用情况,低占用率,复杂的数据访问模式以及几个长期说明的挑战,但我们在NVIDIA V100 GPU上达到了3.7 Tflop/s的双重精神性能,并采用了8个优化步骤。这是理论峰的55%,6.7 Tflop/s,在标称频率1312 MHz下,基于我们的58%FMA比率为5.3 Tflop/s,较高自定义的峰的70%。显示了用于分析此OpenACC内核并优化其性能的一系列技术,包括使用层次屋顶线索模型和性能工具Nsight Compute。该内核表现出在许多高性能计算(HPC)应用中通常可以看到的计算特性,并且预计将对HPC开发人员和计算科学家的普通受众非常有帮助,因为他们在NVIDIA GPU上追求更多的绩效。
Performance optimization can be a daunting task especially as the hardware architecture becomes more and more complex. This paper takes a kernel from the Materials Science code BerkeleyGW, and demonstrates a few performance analysis and optimization techniques. Despite challenges such as high register usage, low occupancy, complex data access patterns, and the existence of several long-latency instructions, we have achieved 3.7 TFLOP/s of double-precision performance on an NVIDIA V100 GPU, with 8 optimization steps. This is 55% of the theoretical peak, 6.7 TFLOP/s, at nominal frequency 1312 MHz, and 70% of the more customized peak based on our 58% FMA ratio, 5.3 TFLOP/s. An array of techniques used to analyze this OpenACC kernel and optimize its performance are shown, including the use of hierarchical Roofline performance model and the performance tool Nsight Compute. This kernel exhibits computational characteristics that are commonly seen in many high-performance computing (HPC) applications, and are expected to be very helpful to a general audience of HPC developers and computational scientists, as they pursue more performance on NVIDIA GPUs.