论文标题

逐步阻止有效且有效的ER

Efficient and Effective ER with Progressive Blocking

论文作者

Galhotra, Sainyam, Firmani, Donatella, Saha, Barna, Srivastava, Divesh

论文摘要

阻止是提高实体分辨率(ER)效率的机制,该机制旨在快速修剪所有不匹配的记录对。但是,根据实体群集大小的分布,现有技术可能是(a)过于侵略性,以至于它们有助于扩展,但可能会对ER的有效性产生不利影响,或者(b)过于允许的,可能损害ER效率。在本文中,我们提出了一种新的渐进阻止方法(pblocking),以实现高效和有效的ER,该方法在不同的实体群集大小分布之间无缝地工作。 pblocking基于这样的见解:只有在ER的产出开始可用时,才能揭示有效性效率的权衡。因此,pblocking在反馈循环中利用部分ER输出来以数据驱动的方式完善阻塞结果。具体来说,我们以传统的阻止方法进行了打击,并逐步改善了块的建筑和评分,直到我们得到所需的权衡,并利用有限的ER结果作为每一轮的指导。我们正式证明pblocking有效收敛($ o(n log^2 n)$时间复杂性,其中n是记录总数)。我们的实验表明,将部分ER输出纳入反馈回路可以分别提高阻塞的效率和有效性,分别提高5倍和60%,从而提高整个ER过程的整体F评分最高60%。

Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness-efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently ($O(n log^2 n)$ time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5x and 60% respectively, improving the overall F-score of the entire ER process up to 60%.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源