论文标题

可自定义的自动标记的方法:解决过度标记和标记文本文档的问题

Method for Customizable Automated Tagging: Addressing the Problem of Over-tagging and Under-tagging Text Documents

论文作者

Pandya, Maharshi R., Reyes, Jessica, Vanderheyden, Bob

论文摘要

使用作者提供的标签来预测新文档的标签,通常会导致标签的过度代表。在作者不提供任何标签的情况下,我们的文件面临着严重的遗迹问题。在本文中,我们提出了一种生成一组通用标签的方法,可以广泛应用于大型文档语料库。首先,使用IBM Watson的NLU服务,我们收集了我们称之为“复杂文档​​标签”的关键字/短语,该报告中的8,854个流行报告。我们将LDA模型应用于这些复杂的文档标签,以生成一组765个唯一的“简单标签”。在将标签应用于文档语料库时,我们通过IBM Watson NLU运行每个文档,并应用适当的简单标签。我们的方法仅使用765个简单的标签,允许我们在88,583个文档中标记87,397个具有至少一个标签的文档。总计87,397个文件中约有92.1%被确定为足够标记。最后,我们讨论了方法的性能及其局限性。

Using author provided tags to predict tags for a new document often results in the overgeneration of tags. In the case where the author doesn't provide any tags, our documents face the severe under-tagging issue. In this paper, we present a method to generate a universal set of tags that can be applied widely to a large document corpus. Using IBM Watson's NLU service, first, we collect keywords/phrases that we call "complex document tags" from 8,854 popular reports in the corpus. We apply LDA model over these complex document tags to generate a set of 765 unique "simple tags". In applying the tags to a corpus of documents, we run each document through the IBM Watson NLU and apply appropriate simple tags. Using only 765 simple tags, our method allows us to tag 87,397 out of 88,583 total documents in the corpus with at least one tag. About 92.1% of the total 87,397 documents are also determined to be sufficiently-tagged. In the end, we discuss the performance of our method and its limitations.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源