论文标题

用COMANTICS揭示数据争吵脚本的语义

Revealing the Semantics of Data Wrangling Scripts With COMANTICS

论文作者

Xiong, Kai, Luo, Zhongsu, Fu, Siwei, Wang, Yongheng, Xu, Mingliang, Wu, Yingcai

论文摘要

数据工作人员通常寻求在各种情况下(例如代码调试,重复使用和维护)中了解数据争吵脚本的语义。但是,由于各种编程语言,功能和参数,对新手数据工作者的理解是具有挑战性的。基于以下观察结果:输入表和输出表之间的差异高度与数据转换的类型相关,我们概述了一个设计空间,其中包括103个特征来描述表差异。然后,我们开发了Comantics,这是一种三步管道,可以自动检测数据转换脚本的语义。第一步的重点是检测争吵代码的每行的表差。其次,我们结合了一个基于特征的组件和一个基于暹罗卷积神经网络的组件,用于检测转换类型。第三,我们通过采用“插槽填充”策略来得出每个数据转换的参数。我们设计实验来评估COMANTICS的性能。此外,我们使用不同域中的三个示例应用程序评估其灵活性。

Data workers usually seek to understand the semantics of data wrangling scripts in various scenarios, such as code debugging, reusing, and maintaining. However, the understanding is challenging for novice data workers due to the variety of programming languages, functions, and parameters. Based on the observation that differences between input and output tables highly relate to the type of data transformation, we outline a design space including 103 characteristics to describe table differences. Then, we develop COMANTICS, a three-step pipeline that automatically detects the semantics of data transformation scripts. The first step focuses on the detection of table differences for each line of wrangling code. Second, we incorporate a characteristic-based component and a Siamese convolutional neural network-based component for the detection of transformation types. Third, we derive the parameters of each data transformation by employing a "slot filling" strategy. We design experiments to evaluate the performance of COMANTICS. Further, we assess its flexibility using three example applications in different domains.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源