Indexing and stemming approaches for the Czech language
Information Processing and Management, 2009/45/6/714-720
This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (<i>n</i>-gram). To evaluate the suggested solutions we used various IR models, including Okapi, <i>Divergence from Randomness</i> (DFR), a statistical language model (LM) as well as the classical <i>tf idf</i> vector-space approach. We found that the <i>Divergence from Randomness</i> paradigm tend to propose better retrieval effectiveness than the Okapi, LM or <i>tf idf</i> models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.
Type de publication
Resource Types::text::journal::journal article