论文标题
快速(ER)通过自我监督的深度不对称度量学习对切碎的文本文档进行重建
Fast(er) Reconstruction of Shredded Text Documents via Self-Supervised Deep Asymmetric Metric Learning
论文作者
论文摘要
切碎文档的重建是在安排纸张(切碎)以重新组装此类文档的原始方面。此任务与支持法医调查特别重要,因为文件可能包含刑事证据。为了替代费力且耗时的手动流程,一些研究人员一直在研究执行自动数字重建的方法。自动重建切碎文档的一个核心问题是对碎片的成对兼容性评估,特别是对于二进制文本文档。在这种情况下,深度学习使在机械污染文档的领域进行准确的重建能够取得巨大进展。但是,一个敏感的问题是,每当必须评估一对切碎时,当前的深层模型解决方案就需要推断。这项工作提出了一种可扩展的深度学习方法,用于测量成对的兼容性,其中推论的数量与切碎的数量线性(而不是四边形)。深层模型没有直接预测兼容性,而是利用了不对称的原始碎片含量将距离与兼容性成正比的通用度量空间。实验结果表明,我们的方法的准确性与最先进的方法相媲美,对于具有505次切碎的测试实例(来自不同文档的20个混合切丝页)的速度约为22倍。
The reconstruction of shredded documents consists in arranging the pieces of paper (shreds) in order to reassemble the original aspect of such documents. This task is particularly relevant for supporting forensic investigation as documents may contain criminal evidence. As an alternative to the laborious and time-consuming manual process, several researchers have been investigating ways to perform automatic digital reconstruction. A central problem in automatic reconstruction of shredded documents is the pairwise compatibility evaluation of the shreds, notably for binary text documents. In this context, deep learning has enabled great progress for accurate reconstructions in the domain of mechanically-shredded documents. A sensitive issue, however, is that current deep model solutions require an inference whenever a pair of shreds has to be evaluated. This work proposes a scalable deep learning approach for measuring pairwise compatibility in which the number of inferences scales linearly (rather than quadratically) with the number of shreds. Instead of predicting compatibility directly, deep models are leveraged to asymmetrically project the raw shred content onto a common metric space in which distance is proportional to the compatibility. Experimental results show that our method has accuracy comparable to the state-of-the-art with a speed-up of about 22 times for a test instance with 505 shreds (20 mixed shredded-pages from different documents).