论文标题
表格数据的语义注释
Semantic Annotation for Tabular Data
论文作者
论文摘要
在表格数据中检测列的语义概念对许多应用程序特别感兴趣,范围从数据集成,清洁,搜索到机器学习中的特征工程和模型构建。最近,一些作品提出了基于学习的或基于启发式模式的语义类型注释的方法。两者都有缺点,可以阻止它们对大量概念或示例进行概括。许多基于神经网络的方法也提出可扩展性问题。此外,这些已知方法在数值数据中都没有很好的作用。我们提出了$ c^2 $,这是概念映射器的列,该列是基于通过合奏的最大似然估计方法。它能够有效利用大量的,尽管有些嘈杂,公开可用的表Corpora除了两个流行的知识图以对结构化数据执行有效,有效的概念预测。我们证明了$ c^2 $对9个数据集的可用技术的有效性,这是迄今为止对此主题的最全面比较。
Detecting semantic concept of columns in tabular data is of particular interest to many applications ranging from data integration, cleaning, search to feature engineering and model building in machine learning. Recently, several works have proposed supervised learning-based or heuristic pattern-based approaches to semantic type annotation. Both have shortcomings that prevent them from generalizing over a large number of concepts or examples. Many neural network based methods also present scalability issues. Additionally, none of the known methods works well for numerical data. We propose $C^2$, a column to concept mapper that is based on a maximum likelihood estimation approach through ensembles. It is able to effectively utilize vast amounts of, albeit somewhat noisy, openly available table corpora in addition to two popular knowledge graphs to perform effective and efficient concept prediction for structured data. We demonstrate the effectiveness of $C^2$ over available techniques on 9 datasets, the most comprehensive comparison on this topic so far.