使用单词袋方法的文本检索任务的不同索引技术的不同索引技术的实验

论文标题

使用单词袋方法的文本检索任务的不同索引技术的不同索引技术的实验

Experiments with Different Indexing Techniques for Text Retrieval tasks on Gujarati Language using Bag of Words Approach

论文作者

Pareek, Jyoti, Joshi, Hardik, Chauhan, Krunal, Patel, Rushikesh

论文摘要

本文提出了各种实验的结果，以改善古吉拉特语文本文档的文本检索。文本检索涉及在给定的一组查询术语中搜索和排名文档。我们已经测试了使用词袋方法的各种检索模型。字袋方法是一种传统的方法，直到迄今为止，文本文档表示为单词集合。诸如频率计数，逆文档频率等之类的措施用于表示用户查询的相关文档。不同的排名模型已用于使用平均平均精度的度量标准来量化排名绩效。古吉拉特语是一种形态上丰富的语言，我们已经比较了诸如删除停止单词，驱动和频繁的病例生成的技术，以衡量信息检索任务的改进。大多数技术都是依赖语言的，需要开发特定语言工具。我们使用普通的未加工单词索引作为基线，与基线相比，应用不同的索引技术后，我们看到了MAP值的显着改善。

This paper presents results of various experiments carried out to improve text retrieval of gujarati text documents. Text retrieval involves searching and ranking of text documents for a given set of query terms. We have tested various retrieval models that uses bag-of-words approach. Bag-of-words approach is a traditional approach that is being used till date where the text document is represented as collection of words. Measures like frequency count, inverse document frequency etc. are used to signify and rank relevant documents for user queries. Different ranking models have been used to quantify ranking performance using the metric of mean average precision. Gujarati is a morphologically rich language, we have compared techniques like stop word removal, stemming and frequent case generation against baseline to measure the improvements in information retrieval tasks. Most of the techniques are language dependent and requires development of language specific tools. We used plain unprocessed word index as the baseline, we have seen significant improvements in comparison of MAP values after applying different indexing techniques when compared to the baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题