朝着互动，自适应和结果感知的大数据分析

论文标题

朝着互动，自适应和结果感知的大数据分析

Towards Interactive, Adaptive and Result-aware Big Data Analytics

论文作者

Kumar, Avinash

论文摘要

随着数据量的增长，大量数据的分析变得越来越重要。已经建立了大数据处理框架，例如Apache Hadoop，Apache AsterixDB和Apache Spark，以满足此需求。这些传统基于集群的大数据处理框架所追求的共同目标是高性能，这通常意味着低端到端的执行时间或延迟。数据分析的广泛采用导致了一个呼吁改善传统的大数据处理方式。已经有要求使分析过程更加互动和适应性，尤其是对于长期运行的工作。初始结果在数据争吵过程中的重要性促使了大数据分析的结果感知方法。这些论文是由这些呼吁在进行Texera Project工作时进行数据处理和过去几年的经验的呼吁所激发的，这是UC Irvine在UC Irvine开发的协作数据分析服务。该论文主要由三个部分组成。第一部分是关于琥珀色引擎的设计，该引擎是Texera服务的后端数据处理框架。第二部分是关于称为Reshape的自适应和结果感知的偏斜处理框架。 Reshape使用快速控制消息来为各种操作员实施迭代偏斜的缓解技术。从重塑中的缓解技术也已从其对用户显示的结果的影响的角度进行了分析。最后一部分是关于Maestro的结果感知的工作流程调度框架。本部分讨论了如何安排工作流程以在计算群集中执行并在此过程中做出结果的决策。这项工作通过将互动性，适应性和结果意识带入流程来改善数据分析过程。

As data volumes grow across applications, analytics of large amounts of data is becoming increasingly important. Big data processing frameworks such as Apache Hadoop, Apache AsterixDB, and Apache Spark have been built to meet this demand. A common objective pursued by these traditional cluster-based big data processing frameworks is high performance, which often means low end-to-end execution time or latency. The widespread adoption of data analytics has led to a call to improve the traditional ways of big data processing. There have been demands for making the analytics process more interactive and adaptive, especially for long running jobs. The importance of initial results in the iterative process of data wrangling has motivated a result-aware approach to big data analytics. This dissertation is motivated by these calls for improvement in data processing and the experiences over the past few years while working on the Texera project, which is a collaborative data analytics service being developed at UC Irvine. This dissertation mainly consists of three parts. The first part is about the design of the Amber engine that serves as the backend data processing framework for the Texera service. The second part is about an adaptive and result-aware skew-handling framework called Reshape. Reshape uses fast control messages to implement iterative skew mitigation techniques for a wide variety of operators. The mitigation techniques in Reshape have also been analyzed from the perspective of their effects on the results shown to the user. The last part is about a result-aware workflow scheduling framework called Maestro. This part talks about how to schedule a workflow for execution on computing clusters and make result-aware decisions while doing so. This work improves the data analytics process by bringing interactivity, adaptivity and result-awareness into the process.

下载PDF全文

下载文献需遵守相关版权规定

论文标题