Repository logo
Research Data
Publications
Projects
Persons
Organizations
English
Français
Log In(current)
  1. Home
  2. Publications
  3. Article de recherche (journal article)
  4. Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

Comparative Study of Indexing and Search Strategies for the Hindi, Marathi, and Bengali Languages

Author(s)
Dolamic, Ljiljana
Savoy, Jacques  
Institut d'informatique  
Date issued
2010
In
ACM Transactions on Asian Language Information Processing (T.A.L.I.P.)
Vol
9
No
3
From page
art. 1
Subjects
Algorithms Measurement Performance
Abstract
The main goal of this article is to describe and evaluate various indexing and search strategies for the Hindi, Bengali, and Marathi languages. These three languages are ranked among the world’s 20 most spoken languages and they share similar syntax, morphology, and writing systems. In this article we examine these languages from an Information Retrieval (IR) perspective through describing the key elements of their inflectional and derivational morphologies, and suggest a light and more aggressive stemming approach based on them. <br> In our evaluation of these stemming strategies we make use of the FIRE 2008 test collections, and then to broaden our comparisons we implement and evaluate two language independent indexing methods: the <i>n</i>-gram and trunc-<i>n</i> (truncation of the first <i>n </i> letters). We evaluate these solutions by applying our various IR models, including the Okapi, Divergence from Randomness (DFR) and statistical language models (LM) together with two classical vector-space approaches: <i>tf idf</i> and <i>Lnu-ltc</i>. <br> Experiments performed with all three languages demonstrate that the I(n<sub>e</sub>)C2 model derived from the Divergence from Randomness paradigm tends to provide the best mean average precision (MAP). Our own tests suggest that improved retrieval effectiveness would be obtained by applying more aggressive stemmers, especially those accounting for certain derivational suffixes, compared to those involving a light stemmer or ignoring this type of word normalization procedure. Comparisons between no stemming and stemming indexing schemes shows that performance differences are almost always statistically significant. When, for example, an aggressive stemmer is applied, the relative improvements obtained are ≈28% for the Hindi language, ≈42% for Marathi, and ≈18% for Bengali, as compared to a no-stemming approach. Based on a comparison of word-based and language-independent approaches we find that the trunc-4 indexing scheme tends to result in performance levels statistically similar to those of an aggressive stemmer, yet better than the 4-gram indexing scheme. A query-by-query analysis reveals the reasons for this, and also demonstrates the advantage of applying a stemming or a trunc-4 indexing scheme.
Publication type
journal article
Identifiers
https://libra.unine.ch/handle/20.500.14713/65208
DOI
10.1145/1838745.1838748
File(s)
Loading...
Thumbnail Image
Download
Name

Dolamic_Ljiljana-Comparative_study_of_indexing-20130107.pdf

Type

Main Article

Size

1.97 MB

Format

Adobe PDF

Université de Neuchâtel logo

Service information scientifique & bibliothèques

Rue Emile-Argand 11

2000 Neuchâtel

contact.libra@unine.ch

Service informatique et télématique

Rue Emile-Argand 11

Bâtiment B, rez-de-chaussée

Powered by DSpace-CRIS

libra v2.1.0

© 2025 Université de Neuchâtel

Portal overviewUser guideOpen Access strategyOpen Access directive Research at UniNE Open Access ORCIDWhat's new