论文标题
基准测试Apache Arrow Flight-用于数据传输,查询和微服务的电线速度协议
Benchmarking Apache Arrow Flight -- A wire-speed protocol for data transfer, querying and microservices
论文作者
论文摘要
在不同的大数据框架和/或数据仓库/存储系统之间移动结构化数据通常会引起大量开销。在大多数情况下,在访问数据所花费的总时间的80 \%以上是在序列化/除外序列化步骤中经过的。柱状数据格式在分析和交易数据库中都广受欢迎。 Apache Arrow是一种统一的柱状内存数据格式,有望提供有效的数据存储,访问,操纵和传输。此外,随着箭头飞行通信功能的引入,该功能是在GRPC顶部构建的,Arrow可以通过TCP网络进行高性能数据传输。 Arrow Flight允许以平台和与语言无关的方式通过网络上的并行箭头传输,并根据开源标准提供高性能,并行性和安全性。 在本文中,我们将一些最近实施的箭头飞行用例汇集在一起,并进行基准测试结果。这些用例包括批量箭头数据传输,查询子系统和飞行作为微服务集成到不同框架中,以显示该协议的吞吐量和可伸缩性结果。我们表明,Doget()和doput()操作的飞行最多可实现高达6000 MB/s和4800 Mb/s的吞吐量。在Mellanox connectx-3或Connect-IB互连节点上的飞行最多可利用总带宽的95%。飞行是可扩展的,可以有效地使用多达一半的可用系统核心进行双向通信。对于像Dremio这样的查询系统,飞行的数量级比ODBC和TurboDBC协议快。与TurboDBC和ODBC连接相比,基于Arrow飞行的DREMIO的实现可更好地执行20倍和30倍。
Moving structured data between different big data frameworks and/or data warehouses/storage systems often cause significant overhead. Most of the time more than 80\% of the total time spent in accessing data is elapsed in serialization/de-serialization step. Columnar data formats are gaining popularity in both analytics and transactional databases. Apache Arrow, a unified columnar in-memory data format promises to provide efficient data storage, access, manipulation and transport. In addition, with the introduction of the Arrow Flight communication capabilities, which is built on top of gRPC, Arrow enables high performance data transfer over TCP networks. Arrow Flight allows parallel Arrow RecordBatch transfer over networks in a platform and language-independent way, and offers high performance, parallelism and security based on open-source standards. In this paper, we bring together some recently implemented use cases of Arrow Flight with their benchmarking results. These use cases include bulk Arrow data transfer, querying subsystems and Flight as a microservice integration into different frameworks to show the throughput and scalability results of this protocol. We show that Flight is able to achieve up to 6000 MB/s and 4800 MB/s throughput for DoGet() and DoPut() operations respectively. On Mellanox ConnectX-3 or Connect-IB interconnect nodes Flight can utilize upto 95\% of the total available bandwidth. Flight is scalable and can use upto half of the available system cores efficiently for a bidirectional communication. For query systems like Dremio, Flight is order of magnitude faster than ODBC and turbodbc protocols. Arrow Flight based implementation on Dremio performs 20x and 30x better as compared to turbodbc and ODBC connections respectively.