使用不平衡类别的优化机器学习工具进行欺诈检测

论文标题

使用不平衡类别的优化机器学习工具进行欺诈检测

Fraud Detection Using Optimized Machine Learning Tools Under Imbalance Classes

论文作者

Isangediok, Mary, Gajamannage, Kelum

论文摘要

欺诈检测是一项具有挑战性的任务，因为随着时间的流逝，欺诈模式的性质变化以及欺诈示例的有限可用性以学习这种复杂的模式。因此，借助智能版本的机器学习（ML）工具的欺诈检测对于确保安全至关重要。欺诈检测是主要的ML分类任务；但是，相应的ML工具的最佳性能取决于最佳的高参数值的使用。此外，在不平衡类中的分类非常具有挑战性，因为它在少数群体中导致绩效差，大多数ML分类技术都忽略了。因此，我们研究了四种最先进的ML技术，即逻辑回归，决策树，随机森林和极端梯度提升，它们适用于处理不平衡类别以最大程度地提高精度并同时降低虚假阳性。首先，这些分类器在两个原始基准不平衡欺诈检测数据集上进行了培训，即网站网站URL和欺诈性信用卡交易。然后，通过实现采样框架，即RandomundSampler，Smote和Smoteenn，为每个原始数据集生产了三个合成平衡的数据集。使用RandomzedSearchCV方法揭示了所有16个实验的最佳超参数。使用两个基准性能指标比较了欺诈检测的16种方法的有效性，即接收器操作特征（AUC ROC）和精度和召回曲线下的面积（AUC PR）下的面积（AUC PR）。对于网站网站URL和信用卡欺诈事务数据集，结果表明，对原始数据的极端梯度提升显示了不平衡数据集中值得信赖的性能，并且可以根据AUC ROC和AUC PR来优于其他三种方法。

Fraud detection is a challenging task due to the changing nature of fraud patterns over time and the limited availability of fraud examples to learn such sophisticated patterns. Thus, fraud detection with the aid of smart versions of machine learning (ML) tools is essential to assure safety. Fraud detection is a primary ML classification task; however, the optimum performance of the corresponding ML tool relies on the usage of the best hyperparameter values. Moreover, classification under imbalanced classes is quite challenging as it causes poor performance in minority classes, which most ML classification techniques ignore. Thus, we investigate four state-of-the-art ML techniques, namely, logistic regression, decision trees, random forest, and extreme gradient boost, that are suitable for handling imbalance classes to maximize precision and simultaneously reduce false positives. First, these classifiers are trained on two original benchmark unbalanced fraud detection datasets, namely, phishing website URLs and fraudulent credit card transactions. Then, three synthetically balanced datasets are produced for each original data set by implementing the sampling frameworks, namely, RandomUnderSampler, SMOTE, and SMOTEENN. The optimum hyperparameters for all the 16 experiments are revealed using the method RandomzedSearchCV. The validity of the 16 approaches in the context of fraud detection is compared using two benchmark performance metrics, namely, area under the curve of receiver operating characteristics (AUC ROC) and area under the curve of precision and recall (AUC PR). For both phishing website URLs and credit card fraud transaction datasets, the results indicate that extreme gradient boost trained on the original data shows trustworthy performance in the imbalanced dataset and manages to outperform the other three methods in terms of both AUC ROC and AUC PR.

下载PDF全文

下载文献需遵守相关版权规定

论文标题