使用独家阅读和写入内存的本地平行分区算法

论文标题

使用独家阅读和写入内存的本地平行分区算法

In-Place Parallel-Partition Algorithms using Exclusive-Read-and-Write Memory

论文作者

Kuszmaul, William, Westover, Alek

论文摘要

我们为分区问题提出了一种现场算法，该算法具有线性工作和多毛体跨度。该算法仅使用独家读/写共享变量，并且可以使用并行循环实现，而无需任何其他并发考虑（即，算法是EREW）。该算法的一个关键特征是它表现出可证明的最佳缓存行为，最多可达小阶因子。我们还为具有线性工作和跨度$ o（\ log n \ cdot \ log \ log \ log n）$的分区问题提供了第二个原位EREW算法，该算法在$ O（\ log \ log \ log \ log n）$ fimers the Optimal跨度之内。 By using this low-span algorithm as a subroutine within the cache-friendly algorithm, we are able to obtain a single EREW algorithm that combines their theoretical guarantees: the algorithm achieves span $O(\log n \cdot \log \log n)$ and optimal cache behavior.直接的结果，我们还使用工作$ O（n \ log n）$，span $ o（\ log^2 n \ cdot \ cdot \ log \ log \ log n）$获得了一个就地的EREW QuickSort算法。尽管用于平行分区的标准EREW算法是在大量内核上绑定的内存式宽度，但我们的缓存友好算法能够通过避免记忆带宽瓶颈来实践实践中实现近乎理想的缩放。该算法的性能与Francis，Pannan，Frias和Petit的障碍算法的性能相当，这是以前的并行EREW分类算法的最新技术，但缺乏理论上的保证，对其跨性别和cache行为缺乏。

We present an in-place algorithm for the partition problem that has linear work and polylogarithmic span. The algorithm uses only exclusive read/write shared variables, and can be implemented using parallel-for-loops without any additional concurrency considerations (i.e., the algorithm is EREW). A key feature of the algorithm is that it exhibits provably optimal cache behavior, up to small-order factors. We also present a second in-place EREW algorithm for the partition problem that has linear work and span $O(\log n \cdot \log \log n)$, which is within an $O(\log\log n)$ factor of the optimal span. By using this low-span algorithm as a subroutine within the cache-friendly algorithm, we are able to obtain a single EREW algorithm that combines their theoretical guarantees: the algorithm achieves span $O(\log n \cdot \log \log n)$ and optimal cache behavior. As an immediate consequence, we also get an in-place EREW quicksort algorithm with work $O(n \log n)$, span $O(\log^2 n \cdot \log \log n)$. Whereas the standard EREW algorithm for parallel partitioning is memory-bandwidth bound on large numbers of cores, our cache-friendly algorithm is able to achieve near-ideal scaling in practice by avoiding the memory-bandwidth bottleneck. The algorithm's performance is comparable to that of the Blocked Strided Algorithm of Francis, Pannan, Frias, and Petit, which is the previous state-of-the art for parallel EREW sorting algorithms, but which lacks theoretical guarantees on its span and cache behavior.

下载PDF全文

下载文献需遵守相关版权规定

论文标题