论文标题
技术报告:开发工作数据中心
Technical Report: Developing a Working Data Hub
论文作者
论文摘要
数据构成任何企业的关键组成部分。希望为其运营利用机器学习或人工智能的组织进一步扩大了对高质量和轻松访问数据的需求。为此,许多组织正在建立用于管理异质数据的资源,为最终用户提供了可用数据的广泛视图,并充当组织由组织拥有/收集的数据的集中存储库。很广泛,我们将这些类别的技术称为“数据中心”。虽然没有明确的定义构成数据中心,但一些关键特征包括:数据目录;链接到数据集或数据集或集中数据存储库的链接;服务 /可视化数据集的基本能力;访问控制策略,以确保安全数据访问并尊重数据所有者的策略;以及与数据中心基础架构相关的计算功能。当然,开发此类数据中心需要许多挑战。本文档在数据库,数据管理和概述了开发和部署工作数据中心的最佳实践和建议中提供了背景。
Data forms a key component of any enterprise. The need for high quality and easy access to data is further amplified by organizations wishing to leverage machine learning or artificial intelligence for their operations. To this end, many organizations are building resources for managing heterogenous data, providing end-users with an organization wide view of available data, and acting as a centralized repository for data owned/collected by an organization. Very broadly, we refer to these class of techniques as a "data hub." While there is no clear definition of what constitutes a data hub, some of the key characteristics include: data catalog; links to data sets or owners of data sets or centralized data repository; basic ability to serve / visualize data sets; access control policies that ensure secure data access and respects policies of data owners; and computing capabilities tied with data hub infrastructure. Of course, developing such a data hub entails numerous challenges. This document provides background in databases, data management and outlines best practices and recommendations for developing and deploying a working data hub.