论文标题

基于年龄的文本分类的特征类型的比较研究

A Comparative Study of Feature Types for Age-Based Text Classification

论文作者

Glazkova, Anna, Egorov, Yury, Glazkov, Maksim

论文摘要

自动确定小说的年龄受众的能力为开发信息检索工具提供了许多机会。首先,图书推荐系统和电子库的开发人员可能有兴趣在最有可能的读者的年龄之前过滤文本。此外,父母可能想为儿童选择文学。最后,对于作家和出版商而言,确定哪些特征会影响文本是否适合儿童将很有用。在本文中,我们将各种类型的语言特征的经验有效性与基于年龄的小说文本分类的任务进行了比较。为此,我们收集了书本的图书预览,上面标有两个类别之一 - 儿童或成人。我们评估了以下类型的功能:可读性指数,情感,词汇,语法和一般特征以及发布属性。获得的结果表明,在文档级别描述文本的功能可以显着提高机器学习模型的质量。

The ability to automatically determine the age audience of a novel provides many opportunities for the development of information retrieval tools. Firstly, developers of book recommendation systems and electronic libraries may be interested in filtering texts by the age of the most likely readers. Further, parents may want to select literature for children. Finally, it will be useful for writers and publishers to determine which features influence whether the texts are suitable for children. In this article, we compare the empirical effectiveness of various types of linguistic features for the task of age-based classification of fiction texts. For this purpose, we collected a text corpus of book previews labeled with one of two categories -- children's or adult. We evaluated the following types of features: readability indices, sentiment, lexical, grammatical and general features, and publishing attributes. The results obtained show that the features describing the text at the document level can significantly increase the quality of machine learning models.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源