论文标题
通过Turing GPU中的位tensor核加速双核神经网络
Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing GPUs
论文作者
论文摘要
尽管预见到对常规深层神经网络的巨大加速,但在通用处理器(例如CPU和GPU)上展示了二元化神经网络(BNN)的性能优势。实际上,由于无法通过基于单词的体系结构利用比特级并行性,因此GPU因执行BNN时的利用率极低而受到批评。因此,NVIDIA Turing GPU中的最新张量开始在实验上支持位计算。在这项工作中,我们研究了这种全新的位计算功能,并表征其独特的功能。我们表明,内存访问的步幅可以显着影响性能传递,并且高度希望获得数据形式的共同设计,以支持与没有张力的现有软件解决方案相比,获得较高的性能。我们意识到了张紧的BNN设计,尤其是完全连接和卷积层的主要功能 - 位矩阵乘法和位卷积。对两个NVIDIA Turing GPU的评估表明,使用RESNET-18,我们的BTC-BNN设计可以以每秒5.6K图像的速度处理ImageNet,比最先进的时间快77%。我们的BNN方法在https://github.com/pnnl/tcbnn上发布。
Despite foreseeing tremendous speedups over conventional deep neural networks, the performance advantage of binarized neural networks (BNNs) has merely been showcased on general-purpose processors such as CPUs and GPUs. In fact, due to being unable to leverage bit-level-parallelism with a word-based architecture, GPUs have been criticized for extremely low utilization (1%) when executing BNNs. Consequently, the latest tensorcores in NVIDIA Turing GPUs start to experimentally support bit computation. In this work, we look into this brand new bit computation capability and characterize its unique features. We show that the stride of memory access can significantly affect performance delivery and a data-format co-design is highly desired to support the tensorcores for achieving superior performance than existing software solutions without tensorcores. We realize the tensorcore-accelerated BNN design, particularly the major functions for fully-connect and convolution layers -- bit matrix multiplication and bit convolution. Evaluations on two NVIDIA Turing GPUs show that, with ResNet-18, our BTC-BNN design can process ImageNet at a rate of 5.6K images per second, 77% faster than state-of-the-art. Our BNN approach is released on https://github.com/pnnl/TCBNN.