论文标题
theedhum nandrum@dravidian-codemix-fire2020:YouTube评论的情感极性分类器,并在泰米尔语,马拉雅拉姆语和英语之间进行密码转换
Theedhum Nandrum@Dravidian-CodeMix-FIRE2020: A Sentiment Polarity Classifier for YouTube Comments with Code-switching between Tamil, Malayalam and English
论文作者
论文摘要
TheEdhum Nandrum是一种使用两种方法的情感极性检测系统 - 基于随机梯度下降(SGD)的分类器和基于短期记忆(LSTM)分类器的长期分类器。我们的方法利用语言功能,例如使用表情符号,选择脚本和代码混合,这些功能在为Dravidian Codemix指定的数据集中显着标记 - Fire 2020任务。使用GridSearchCV调整了SGD的超参数。我们的系统在泰米尔语英语中排名第四,加权平均F1得分为0.62,在马拉雅拉姆语英语中排名第9,得分为0.65。在任务截止日期后,使用基于逻辑回归的模型,我们获得了泰米尔语 - 英语的加权平均F1得分为0.77。这种性能使该数据集中最高的分类器更加宽泛。我们使用特定于语言的Soundex来协调代码混合数据中的拼写变体似乎是SoundEx的新应用。我们的完整代码发表在https://github.com/oligoglot/theedhum-nandrum上。
Theedhum Nandrum is a sentiment polarity detection system using two approaches--a Stochastic Gradient Descent (SGD) based classifier and a Long Short-term Memory (LSTM) based Classifier. Our approach utilises language features like use of emoji, choice of scripts and code mixing which appeared quite marked in the datasets specified for the Dravidian Codemix - FIRE 2020 task. The hyperparameters for the SGD were tuned using GridSearchCV. Our system was ranked 4th in Tamil-English with a weighted average F1 score of 0.62 and 9th in Malayalam-English with a score of 0.65. We achieved a weighted average F1 score of 0.77 for Tamil-English using a Logistic Regression based model after the task deadline. This performance betters the top ranked classifier on this dataset by a wide margin. Our use of language-specific Soundex to harmonise the spelling variants in code-mixed data appears to be a novel application of Soundex. Our complete code is published in github at https://github.com/oligoglot/theedhum-nandrum.