论文标题
多级流媒体分析的大数据湖
A Big Data Lake for Multilevel Streaming Analytics
论文作者
论文摘要
大型组织正在寻求创建新的架构和可扩展平台,以有效地应对数据管理挑战,这是由于过去很少见到的数据的爆炸性。这些数据管理挑战在很大程度上是由于从多种格式的各种来源的高速流数据提供的可用性。数据范式的变化导致了新的数据分析和管理体系结构的出现。本文着重于将大量,速度和品种数据以原始格式存储在称为数据湖的数据存储架构中。首先,我们介绍了有关传统数据仓库在处理最新数据范例变化时的局限性的研究。我们讨论并比较可用于开发数据湖的不同开源和商业平台。然后,我们使用Hadoop数据平台(HDP)上的Hadoop分布式文件系统(HDF)描述了我们的端到端数据湖设计和实施方法。最后,我们提出了一个现实世界中的数据湖开发用例,用于结合结构化和非结构化数据的数据流摄入,分期和多级流分析。这项研究可以作为计划为其用例实施数据湖解决方案的个人或组织的指南。
Large organizations are seeking to create new architectures and scalable platforms to effectively handle data management challenges due to the explosive nature of data rarely seen in the past. These data management challenges are largely posed by the availability of streaming data at high velocity from various sources in multiple formats. The changes in data paradigm have led to the emergence of new data analytics and management architecture. This paper focuses on storing high volume, velocity and variety data in the raw formats in a data storage architecture called a data lake. First, we present our study on the limitations of traditional data warehouses in handling recent changes in data paradigms. We discuss and compare different open source and commercial platforms that can be used to develop a data lake. We then describe our end-to-end data lake design and implementation approach using the Hadoop Distributed File System (HDFS) on the Hadoop Data Platform (HDP). Finally, we present a real-world data lake development use case for data stream ingestion, staging, and multilevel streaming analytics which combines structured and unstructured data. This study can serve as a guide for individuals or organizations planning to implement a data lake solution for their use cases.