论文标题

工程(共享内存)排序算法

Engineering In-place (Shared-memory) Sorting Algorithms

论文作者

Axtmann, Michael, Witt, Sascha, Ferizovic, Daniel, Sanders, Peter

论文摘要

我们提出了代表广泛的输入大小,输入分布,数据类型和计算机的最快已知技术的排序算法。速度优势的一部分是由于该功能在就位工作。以前,现场功能通常暗示性能处罚。我们的主要算法贡献是对就地数据分布的块方法,可证明可以缓存有效。我们还将这种方法并行化,考虑了动态负载平衡和内存位置。我们的基于比较的算法,现场超级样本(IPS $^4 $ O),将此技术与无分支决策树相结合。通过考虑具有许多均等元素的案例并通过动态调整分布度,我们获得了一种高度鲁棒的算法,该算法的表现优于最佳的基于基于平行比较的竞争者几乎三倍。 IPS $^4 $ o还胜过原地或不可接地,平行或顺序设置的最佳基于比较的竞争对手。 IPS $^4 $ o甚至在多种情况下都胜过最佳整数排序算法。在许多剩余情况下(通常涉及接近均匀的输入分布,小键或顺序设置),我们的新的原位放射线分落器证明是最好的算法。从某种意义上说,声称具有“最佳”排序算法可以在许多不可能是真实的论文中找到。因此,我们的结论基于广泛的实验,涉及21种最先进的排序代码,6种数据类型,10个输入分布,4台机器,4个记忆分配策略和输入尺寸的大部分跨产品的很大一部分。这证实了我们的算法的出色表现,同时揭示了在相关出版物中报告的一组测量值之外的许多竞争对手中的重大表现问题。

We present sorting algorithms that represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. A part of the speed advantage is due to the feature to work in-place. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our comparison-based algorithm, In-place Superscalar Samplesort (IPS$^4$o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best in-place parallel comparison-based competitor by almost a factor of three. IPS$^4$o also outperforms the best comparison-based competitors in the in-place or not in-place, parallel or sequential settings. IPS$^4$o even outperforms the best integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new in-place radix sorter turns out to be the best algorithm. Claims to have the, in some sense, "best" sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on extensive experiments involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源