论文标题

Web文档分类使用幼稚的贝叶斯分类器和潜在语义分析

Web Document Categorization Using Naive Bayes Classifier and Latent Semantic Analysis

论文作者

Sedghpour, Alireza Saleh, Sedghpour, Mohammad Reza Saleh

论文摘要

由于大量使用万维网,网络文档的快速增长需要有效的技术来有效地在网络上对文档进行分类。因此,它每秒产生了高度多样性的大量数据。自动对这些不断增长的Web文档进行分类是当今我们面临的最大挑战之一。概率分类算法(例如幼稚贝叶斯)通常用于Web文档分类。这个问题主要是因为在足够的应用领域的分类准确性较高,并且缺乏支持高维和稀疏数据的支持,这是文本数据表示的独家特征。同样,在处理大数据和大规模的网络文档时,使用传统特征选择方法之间缺乏关注并支持单词之间的语义关系。为了解决该问题,我们提出了一种用于Web文档分类的方法,该方法使用LSA来增加同一类中文档的相似性并提高分类精度。使用这种方法,我们为Web文档设计了一个更快,更准确的分类器。实验结果表明,使用上述预处理可以提高幼稚贝叶斯的准确性和速度,精确度和召回指标已表明改善。

A rapid growth of web documents due to heavy use of World Wide Web necessitates efficient techniques to efficiently classify the document on the web. It is thus produced High volumes of data per second with high diversity. Automatically classification of these growing amounts of web document is One of the biggest challenges facing us today. Probabilistic classification algorithms such as Naive Bayes have become commonly used for web document classification. This problem is mainly because of the irrelatively high classification accuracy on plenty application areas as well as their lack of support to handle high dimensional and sparse data which is the exclusive characteristics of textual data representation. also it is common to Lack of attention and support the semantic relation between words using traditional feature selection method When dealing with the big data and large-scale web documents. In order to solve the problem, we proposed a method for web document classification that uses LSA to increase similarity of documents under the same class and improve the classification precision. Using this approach, we designed a faster and much accurate classifier for Web Documents. Experimental results have shown that using the mentioned preprocessing can improve accuracy and speed of Naive Bayes availably, the precision and recall metrics have indicated the improvement.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源