论文标题
在多模式模因中检测仇恨言论
Detecting Hate Speech in Multi-modal Memes
论文作者
论文摘要
在过去的几年中,从图像字幕到视觉问题的回答以及其他方式,人们一直对多模式问题产生了兴趣。在本文中,我们专注于多模式模因中的仇恨言语检测,其中模因构成了一个有趣的多模式融合问题。我们旨在解决Facebook Meme Challenge \ Cite {Kiela2020hateful},该}旨在解决一个二进制分类问题,以预测模因是否是可恶的。挑战的一个关键特征是,它包括“良性混杂因素”,以应对模型利用单峰先验的可能性。挑战指出,与人类相比,最先进的模型的表现不佳。在对数据集的分析过程中,我们意识到,最初可恨的大多数数据点都会描述模因的形象。同样,大多数多模式基线都更偏爱仇恨言论(语言方式)。为了解决这些问题,我们使用对象检测和图像字幕模型探索视觉模态,以获取“实际字幕”,然后将其与多模式表示形式结合在一起以执行二进制分类。这种方法可以解决数据集中存在的良性文本混杂因素,以提高性能。我们实验的另一种方法是通过情感分析改善预测。与其仅使用从预训练的神经网络中获得的多模式表示,我们还包括非模态情绪来丰富特征。我们对上述两种方法进行详细的分析,提供了令人信服的理由,以支持所使用的方法。
In the past few years, there has been a surge of interest in multi-modal problems, from image captioning to visual question answering and beyond. In this paper, we focus on hate speech detection in multi-modal memes wherein memes pose an interesting multi-modal fusion problem. We aim to solve the Facebook Meme Challenge \cite{kiela2020hateful} which aims to solve a binary classification problem of predicting whether a meme is hateful or not. A crucial characteristic of the challenge is that it includes "benign confounders" to counter the possibility of models exploiting unimodal priors. The challenge states that the state-of-the-art models perform poorly compared to humans. During the analysis of the dataset, we realized that majority of the data points which are originally hateful are turned into benign just be describing the image of the meme. Also, majority of the multi-modal baselines give more preference to the hate speech (language modality). To tackle these problems, we explore the visual modality using object detection and image captioning models to fetch the "actual caption" and then combine it with the multi-modal representation to perform binary classification. This approach tackles the benign text confounders present in the dataset to improve the performance. Another approach we experiment with is to improve the prediction with sentiment analysis. Instead of only using multi-modal representations obtained from pre-trained neural networks, we also include the unimodal sentiment to enrich the features. We perform a detailed analysis of the above two approaches, providing compelling reasons in favor of the methodologies used.