论文标题

大数据流的分布式实时建议系统

A Distributed Real-Time Recommender System for Big Data Streams

论文作者

Hazem, Heidy, Awad, Ahmed, Hassan, Ahmed

论文摘要

在当今数据驱动的世界中,推荐系统(RS)在支持决策过程中起着至关重要的作用。随着用户连续连接到互联网,他们变得越来越耐心,对RS(例如,在Netflix上或书籍上的电影推荐)提出的过时建议的耐受性较低,以在亚马逊上阅读。反过来,这需要对RS进行持续的培训,以应对数据的在线方式以及用户品味和兴趣的不断变化的性质,即概念漂移。流媒体(在线)RS必须满足三个要求:持续培训和建议,处理概念漂移以及扩展能力。文献中提出的流媒体推荐系统主要解决了前两个要求,并且不考虑可扩展性。那是因为他们在一台机器上运行训练过程。这样的机器,无论其强大多么强大,最终都将无法应对数据的数量,这是从大数据处理中学到的一堂课。为了应对第三项挑战,我们提出了一种用于构建分布式流推专制系统的分裂和复制机制。我们的机制的灵感来自于成功的共享架构,该体系结构为当代大数据处理系统提供了支持。我们已经将机制应用于在线推荐系统的两种知名方法,即矩阵分解和基于项目的协作过滤。我们已经在Apache Flink上实现了我们的机制。我们进行了实验,将基线(单机)方法的性能与分布式方法进行了比较。已经观察到评估不同的数据集,加工延迟,吞吐量和准确性的改善。我们的实验显示在线召回改善40 \%,而记忆消耗少50 \%。

In today's data-driven world, recommender systems (RS) play a crucial role to support the decision-making process. As users become continuously connected to the internet, they become less patient and less tolerant to obsolete recommendations made by an RS, e.g., movie recommendations on Netflix or books to read on Amazon. This, in turn, requires continuous training of the RS to cope with both the online fashion of data and the changing nature of user tastes and interests, known as concept drift. Streaming (online) RS has to address three requirements: continuous training and recommendation, handling concept drifts, and ability to scale. Streaming recommender systems proposed in the literature mostly, address the first two requirements and do not consider scalability. That is because they run the training process on a single machine. Such a machine, no matter how powerful it is, will eventually fail to cope with the volume of the data, a lesson learned from big data processing. To tackle the third challenge, we propose a Splitting and Replication mechanism for building distributed streaming recommender systems. Our mechanism is inspired by the successful shared-nothing architecture that underpins contemporary big data processing systems. We have applied our mechanism to two well-known approaches for online recommender systems, namely, matrix factorization and item-based collaborative filtering. We have implemented our mechanism on top of Apache Flink. We conducted experiments comparing the performance of the baseline (single machine) approach with our distributed approach. Evaluating different data sets, improvement in processing latency, throughput, and accuracy have been observed. Our experiments show online recall improvement by 40\% with more than 50\% less memory consumption.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源