从Web档案中识别集合的文档

论文标题

从Web档案中识别集合的文档

Identifying Documents In-Scope of a Collection from Web Archives

论文作者

Patel, Krutarth, Caragea, Cornelia, Phillips, Mark, Fox, Nathaniel

论文摘要

Web存档数据通常包含高质量的文档，这些文档对于创建专业的文档集合非常有用，例如科学数字库和技术报告的存储库。这样一来，自动方法的需求很大，可以从Web档案机构收集的大量文件中区分收集文档的文档。在本文中，我们探讨了不同的学习模型和功能表示形式，以确定从网络存档数据中识别感兴趣的文档的最佳性能。具体来说，我们研究机器学习和深度学习模型以及从整个文档或文档的特定部分中提取的“单词袋”（弓）的特征，以及捕获文档结构的结构特征。我们将评估集中在我们从三个不同的Web档案中创建的三个数据集上。我们的实验结果表明，仅关注文档的特定部分（而不是全文）的弓分类器优于所有三个数据集上的方法。

Web archive data usually contains high-quality documents that are very useful for creating specialized collections of documents, e.g., scientific digital libraries and repositories of technical reports. In doing so, there is a substantial need for automatic approaches that can distinguish the documents of interest for a collection out of the huge number of documents collected by web archiving institutions. In this paper, we explore different learning models and feature representations to determine the best performing ones for identifying the documents of interest from the web archived data. Specifically, we study both machine learning and deep learning models and "bag of words" (BoW) features extracted from the entire document or from specific portions of the document, as well as structural features that capture the structure of documents. We focus our evaluation on three datasets that we created from three different Web archives. Our experimental results show that the BoW classifiers that focus only on specific portions of the documents (rather than the full text) outperform all compared methods on all three datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题