论文标题
基于等位基因分区的记录链接的先验
A Prior for Record Linkage Based on Allelic Partitions
论文作者
论文摘要
在数据库管理中,记录链接旨在确定与同一个人相对应的多个记录。该任务可以视为聚类问题,其中潜在实体与一个或多个嘈杂的数据库记录相关联。但是,与传统的聚类应用相比,在这种情况下,每一个群集都有一些观察结果。在本文中,我们基于等位基因分区介绍了一类新的先验分布类,该分区特别适合记录链接的小集群设置。我们的方法使得在不同尺度下引入有关簇大小分布的先前信息变得直接,并且自然会实施最大簇大小的均方根生长 - 称为微簇属性。我们还引入了一组新型的微聚集条件,以便对群集大小提出进一步的约束。我们使用模拟数据和三个官方统计数据集评估了提议的先验类别的性能,并表明我们的模型与记录链接文献中最新的微群集模型相比提供了竞争性结果。此外,我们使用文献中最近提出的基于决策理论的方法比较了不同损失函数的性能,以对分区进行最佳估算。
In database management, record linkage aims to identify multiple records that correspond to the same individual. This task can be treated as a clustering problem, in which a latent entity is associated with one or more noisy database records. However, in contrast to traditional clustering applications, a large number of clusters with a few observations per cluster is expected in this context. In this paper, we introduce a new class of prior distributions based on allelic partitions that is specially suited for the small cluster setting of record linkage. Our approach makes it straightforward to introduce prior information about the cluster size distribution at different scales, and naturally enforces sublinear growth of the maximum cluster size -known as the microclustering property. We also introduce a set of novel microclustering conditions in order to impose further constraints on the cluster sizes a priori. We evaluate the performance of our proposed class of priors using simulated data and three official statistics data sets, and show that our models provide competitive results compared to state-of-the-art microclustering models in the record linkage literature. Moreover, we compare the performance of different loss functions for optimal point estimation of the partitions using decision-theoretical based approaches recently proposed in the literature.