通过文本和图像分析对高影响研究的机器识别

论文标题

通过文本和图像分析对高影响研究的机器识别

Machine Identification of High Impact Research through Text and Image Analysis

论文作者

Stamenovic, Marko, Luo, Jeibo

论文摘要

学术论文提交和出版物的数量正在以不断增长的速度增长。尽管这种研究泛滥有望在各个领域的进步，但巨大的产出本质上会增加噪声的量。我们提出了一个系统，可以自动将其高度分离的论文与引用可能性很小的那些较低的论文分开，以迅速找到高影响力，高质量的研究。我们的系统既使用视觉分类器，可用于揭示文档的整体外观，也可以使用文本分类器来做出内容信息。该领域的当前工作着重于由各个会议的论文组成的小型数据集。尝试在较大数据集上使用类似技术的尝试通常只考虑文档的摘录，例如抽象，可能会丢弃有价值的数据。我们通过提供由PDF文档组成的数据集来纠正这些问题，并在两个单独的学术领域内的产出数十年：计算机科学和医学。这个新的数据集使我们能够通过跨时间和学术领域概括现场的当前工作。此外，我们探讨了域间的预测模型 - 评估分类器在未经训练的域上的性能 - 以进一步了解这个重要问题。

The volume of academic paper submissions and publications is growing at an ever increasing rate. While this flood of research promises progress in various fields, the sheer volume of output inherently increases the amount of noise. We present a system to automatically separate papers with a high from those with a low likelihood of gaining citations as a means to quickly find high impact, high quality research. Our system uses both a visual classifier, useful for surmising a document's overall appearance, and a text classifier, for making content-informed decisions. Current work in the field focuses on small datasets composed of papers from individual conferences. Attempts to use similar techniques on larger datasets generally only considers excerpts of the documents such as the abstract, potentially throwing away valuable data. We rectify these issues by providing a dataset composed of PDF documents and citation counts spanning a decade of output within two separate academic domains: computer science and medicine. This new dataset allows us to expand on current work in the field by generalizing across time and academic domain. Moreover, we explore inter-domain prediction models - evaluating a classifier's performance on a domain it was not trained on - to shed further insight on this important problem.

下载PDF全文

下载文献需遵守相关版权规定

论文标题