通过加强学习来实现帕累托有效的公平耐用性权衡取回诉讼

论文标题

通过加强学习来实现帕累托有效的公平耐用性权衡取回诉讼

Toward Pareto Efficient Fairness-Utility Trade-off inRecommendation through Reinforcement Learning

论文作者

Ge, Yingqiang, Zhao, Xiaoting, Yu, Lucia, Paul, Saurabh, Hu, Diane, Hsieh, Chu-Cheng, Zhang, Yongfeng

论文摘要

随着推荐系统的触摸和越来越多的人在日常生活中，推荐的公平性问题变得越来越重要。在公平意识的建议中，大多数现有的算法方法主要旨在通过对公平程度施加约束，同时优化主要建议目标，例如CTR。尽管这减轻了不公平建议的影响，但由于公平与公用事业之间固有的权衡，预期的方法可能会严重损害建议准确性。这激发了我们应对这些矛盾的目标，并在建议中探索它们之间的最佳权衡。一种显着的方法是寻求帕累托有效的解决方案，以确保效用与公平之间的最佳妥协。此外，考虑到现实世界电子商务平台的需求，如果我们可以概括整个帕累托前沿，以便决策者可以根据他们当前的业务需求指定一个目标而不是另一个目标的任何偏好。因此，在这项工作中，我们提出了一个使用多目标强化学习的公平感知推荐框架，称为Mofir，该学习能够在所有可能的偏好方面学习单个参数表示，以实现最佳建议策略。特别是，我们通过将条件网络引入该网络来修改传统DDPG，该网络直接根据这些偏好和输出Q值向量来调整网络。与所有其他基线相比，几个现实世界建议数据集的实验验证了我们框架对公平度量标准和建议措施的优越性。我们还提取了Mofir生成的现实世界数据集上的大约帕累托前沿，并与最先进的公平方法进行比较。

The issue of fairness in recommendation is becoming increasingly essential as Recommender Systems touch and influence more and more people in their daily lives. In fairness-aware recommendation, most of the existing algorithmic approaches mainly aim at solving a constrained optimization problem by imposing a constraint on the level of fairness while optimizing the main recommendation objective, e.g., CTR. While this alleviates the impact of unfair recommendations, the expected return of an approach may significantly compromise the recommendation accuracy due to the inherent trade-off between fairness and utility. This motivates us to deal with these conflicting objectives and explore the optimal trade-off between them in recommendation. One conspicuous approach is to seek a Pareto efficient solution to guarantee optimal compromises between utility and fairness. Moreover, considering the needs of real-world e-commerce platforms, it would be more desirable if we can generalize the whole Pareto Frontier, so that the decision-makers can specify any preference of one objective over another based on their current business needs. Therefore, in this work, we propose a fairness-aware recommendation framework using multi-objective reinforcement learning, called MoFIR, which is able to learn a single parametric representation for optimal recommendation policies over the space of all possible preferences. Specially, we modify traditional DDPG by introducing conditioned network into it, which conditions the networks directly on these preferences and outputs Q-value-vectors. Experiments on several real-world recommendation datasets verify the superiority of our framework on both fairness metrics and recommendation measures when compared with all other baselines. We also extract the approximate Pareto Frontier on real-world datasets generated by MoFIR and compare to state-of-the-art fairness methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题