通过分层聚类和监督学习对垃圾邮件发送电子邮件的分类

论文标题

通过分层聚类和监督学习对垃圾邮件发送电子邮件的分类

Classification of Spam Emails through Hierarchical Clustering and Supervised Learning

论文作者

Jáñez-Martino, Francisco, Fidalgo, Eduardo, González-Martínez, Santiago, Velasco-Mata, Javier

论文摘要

垃圾邮件发送者利用电子邮件受欢迎程度发送不加选择的电子邮件。尽管研究人员和组织不断基于二进制分类而不断开发反垃圾邮件过滤器，但垃圾邮件发送者通过新策略（例如offusfuscation或基于图像的垃圾邮件）绕过它们。在文献中，我们首次建议将垃圾邮件发送给类别中的垃圾邮件分类，以改善已经检测到的垃圾邮件电子邮件的处理方法，而不仅仅是使用二进制模型。首先，我们应用了层次聚类算法来创建SPEMC- $ 11 $ K（垃圾邮件电子邮件分类），第一个多级数据集，其中包含三种类型的垃圾邮件电子邮件：健康和技术，个人骗局和性内容。然后，我们使用SPEMC- $ 11 $ K来评估TF-IDF和弓形编码与幼稚的贝叶斯，决策树和SVM分类器的组合。最后，我们建议将（i）TF-IDF与SVM结合使用最佳的Micro F1得分性能，$ 95.39 \％$和（ii）TD-IDF以及NB以及NB以及NB以及NB以及NB以及NB以及NB以及最快的垃圾邮件分类，以$ 2.13 $ 2.13 $ MS分析电子邮件。

Spammers take advantage of email popularity to send indiscriminately unsolicited emails. Although researchers and organizations continuously develop anti-spam filters based on binary classification, spammers bypass them through new strategies, like word obfuscation or image-based spam. For the first time in literature, we propose to classify spam email in categories to improve the handle of already detected spam emails, instead of just using a binary model. First, we applied a hierarchical clustering algorithm to create SPEMC-$11$K (SPam EMail Classification), the first multi-class dataset, which contains three types of spam emails: Health and Technology, Personal Scams, and Sexual Content. Then, we used SPEMC-$11$K to evaluate the combination of TF-IDF and BOW encodings with Naïve Bayes, Decision Trees and SVM classifiers. Finally, we recommend for the task of multi-class spam classification the use of (i) TF-IDF combined with SVM for the best micro F1 score performance, $95.39\%$, and (ii) TD-IDF along with NB for the fastest spam classification, analyzing an email in $2.13$ms.

下载PDF全文

下载文献需遵守相关版权规定

论文标题