论文标题

重新访问霍夫曼编码:迈向现代GPU架构的极端性能

Revisiting Huffman Coding: Toward Extreme Performance on Modern GPU Architectures

论文作者

Tian, Jiannan, Rivera, Cody, Di, Sheng, Chen, Jieyang, Liang, Xin, Tao, Dingwen, Cappello, Franck

论文摘要

当今的高性能计算(HPC)应用程序正在生产大量数据,这些数据在执行过程中有效地存储和传输具有挑战性,因此数据压缩正成为减轻存储负担和数据移动成本的关键技术。 Huffman编码可以说是信息理论中最有效的熵编码算法,因此可以在许多现代压缩算法(如放气)等许多现代压缩算法中找到它。另一方面,当今的HPC应用程序越来越依赖于超级计算机上的加速器(例如GPU),而Huffman编码GPU上的吞吐量较低,从而在整个数据处理中产生了重要的瓶颈。在本文中,我们根据现代GPU体系结构提出并实施了一种有效的霍夫曼编码方法,该方法解决了两个关键挑战:(1)如何平行整个编码算法的霍夫曼编码算法,包括代码书构建,以及(2)如何完全利用现代GPU Architectures现代GPU Architectures的高内存携带者特征。详细的贡献是四倍。 (1)我们在GPU上开发了有效的并行代码簿构建,该构造可以通过输入符号的数量有效地扩展。 (2)我们提出了一种新型基于还原的编码方案,该方案可以有效地在GPU上合并代码字。 (3)我们通过利用最先进的CUDA API(例如合作组)来优化总体GPU性能。 (4)我们使用两个高级GPU上的六个现实世界应用程序数据集对Huffman编码器进行了彻底评估,并与我们实现的多线Huffman编码器进行比较。实验表明,我们的解决方案可以在NVIDIA RTX 5000和V100上分别在最先进的GPU Huffman编码器上提高编码吞吐量,并在NVIDIA RTX 5000和V100上分别提高编码吞吐量,并在两个28核Xeon Platinum 8280 cpus上对最先进的GPU Huffman编码器以及最高3.3倍的编码。

Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is four-fold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multi-threaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0X and 6.8X on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3X over the multi-thread encoder on two 28-core Xeon Platinum 8280 CPUs.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源