论文标题

TF-CR:文本分类的加权嵌入

TF-CR: Weighting Embeddings for Text Classification

论文作者

Zubiaga, Arkaitz

论文摘要

文本分类作为将类别分配给文本实例的任务,是信息科学中非常普遍的任务。方法学习单词的分布式表示,例如单词嵌入式,近年来已成为文本分类任务的功能。尽管越来越多地使用单词嵌入进行文本分类,但通常以无监督的方式使用它们,即未利用培训数据中的类标签的信息。虽然单词嵌入固有地捕获单词的分布特征以及在大型数据集中观察到的上下文,但它们并未优化以考虑手头分类数据集中单词的分布。为了通过将类分布纳入培训数据来优化基于单词嵌入的文本表示形式,我们建议使用加权方案,这些加权方案根据每个班级的显着性将权重分配给每个单词的嵌入。为了实现这一目标,我们引入了一种新颖的加权方案,术语频率类别比(TF-CR),该方案在计算单词嵌入时可以加权高频,类别排定的单词。我们在16个分类数据集上进行的实验显示了TF-CR的有效性,从而提高了现有加权方案的性能得分,其性能差距随训练数据的规模增长而增加。

Text classification, as the task consisting in assigning categories to textual instances, is a very common task in information science. Methods learning distributed representations of words, such as word embeddings, have become popular in recent years as the features to use for text classification tasks. Despite the increasing use of word embeddings for text classification, these are generally used in an unsupervised manner, i.e. information derived from class labels in the training data are not exploited. While word embeddings inherently capture the distributional characteristics of words, and contexts observed around them in a large dataset, they aren't optimised to consider the distributions of words across categories in the classification dataset at hand. To optimise text representations based on word embeddings by incorporating class distributions in the training data, we propose the use of weighting schemes that assign a weight to embeddings of each word based on its saliency in each class. To achieve this, we introduce a novel weighting scheme, Term Frequency-Category Ratio (TF-CR), which can weight high-frequency, category-exclusive words higher when computing word embeddings. Our experiments on 16 classification datasets show the effectiveness of TF-CR, leading to improved performance scores over existing weighting schemes, with a performance gap that increases as the size of the training data grows.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源