论文标题
在作者分析中抑制域样式的重要性
The Importance of Suppressing Domain Style in Authorship Analysis
论文作者
论文摘要
许多作者分析方法的先决条件是写作风格的代表。但是,尽管进行了数十年的研究,但与更特定于领域的样式组件甚至主题相比,仍然尚不清楚何种程度常用和广泛接受的表示诸如角色Trigram频率之类的代表实际上代表了作者的写作风格。我们是在固定作者的新型实验设置中首次解决这一缺点,但在训练和测试之间交换了域。通过这种设置,我们揭示了使用角色Trigram特征的方法非常容易受到域信息的影响,而无需注意域,在域交换下,分类精度的折磨最高可达55.4个百分点。我们进一步提出了一种基于域 - 对抗性学习的新疗法,并将其与基于启发式规则的文献中的疗法进行比较。两者都可以很好地工作,将域交换的准确性损失分别降低到3.6%和3.9%。
The prerequisite of many approaches to authorship analysis is a representation of writing style. But despite decades of research, it still remains unclear to what extent commonly used and widely accepted representations like character trigram frequencies actually represent an author's writing style, in contrast to more domain-specific style components or even topic. We address this shortcoming for the first time in a novel experimental setup of fixed authors but swapped domains between training and testing. With this setup, we reveal that approaches using character trigram features are highly susceptible to favor domain information when applied without attention to domains, suffering drops of up to 55.4 percentage points in classification accuracy under domain swapping. We further propose a new remedy based on domain-adversarial learning and compare it to ones from the literature based on heuristic rules. Both can work well, reducing accuracy losses under domain swapping to 3.6% and 3.9%, respectively.