论文标题
FAAS管道的里程碑;对象存储与VM驱动的数据交换
A milestone for FaaS pipelines; object storage vs VM-driven data exchange
论文作者
论文摘要
无服务器功能可提供高水平的并行性,较短的启动时间和“付费”计费。这些属性使它们成为数据分析工作流的自然基材。但是,功能之间的直接通信的不可能使工作流的执行具有挑战性。当前在功能之间共享中间数据的实践是通过远程对象存储(例如IBM COS)。与传统的智慧相反,对象存储的性能尚未得到充分理解。例如,对象存储甚至可以优于其他更简单的方法,例如在功能强大的VM中执行洗牌阶段(例如GroupBy),以避免函数之间的全部转移。利用基因组学管道,我们表明,当对象存储是合理的数据传递时,当适当数量的函数用于洗牌阶段时。
Serverless functions provide high levels of parallelism, short startup times, and "pay-as-you-go" billing. These attributes make them a natural substrate for data analytics workflows. However, the impossibility of direct communication between functions makes the execution of workflows challenging. The current practice to share intermediate data among functions is through remote object storage (e.g., IBM COS). Contrary to conventional wisdom, the performance of object storage is not well understood. For instance, object storage can even be superior to other simpler approaches like the execution of shuffle stages (e.g., GroupBy) inside powerful VMs to avoid all-to-all transfers between functions. Leveraging a genomics pipeline, we show that object storage is a reasonable choice for data passing when the appropriate number of functions is used in shuffling stages.