Voici les éléments 1 - 2 sur 2
  • Publication
    Accès libre
    Recherche d’information dans un corpus bruité (OCR)
    ; ;
    Dolamic, Ljiljana
    Cet article désire mesurer la perte de performance lors de la recherche d'information dans une collection de documents scannés. Disposant d'un corpus sans erreur et de deux versions renfermant 5 % et 20 % d'erreurs en reconnaissance, nous avons évalué six modèles de recherche d'information basés sur trois représentations des documents (sac de mots, n-grammes, ou trunc-n) et trois enracineurs. Basé sur l'inverse du rang du premier document pertinent dépisté, nous démontrons que la perte de performance se situe aux environs de - 17 % avec un taux d'erreur en reconnaissance de 5 % et s'élève à – 46 % si ce taux grimpe à 20 %. La représentation par 4-grammes semble apporter une meilleure qualité de réponse avec un corpus bruité. Concernant l'emploi ou non d'un enracineur léger ou la pseudo-rétroaction positive, aucune conclusion définitive ne peut être tirée., This paper evaluates the retrieval effectiveness degradation when facing with noisy text corpus. With the use of a test-collection having the clean text, another version with around 5% error rate in recognition and a third with 20% error rate, we have evaluated six IR models based on three text representations (bag-of-words, n-grams, trunc-n) as well as three stemmers. Using the mean reciprocal rank as performance measure, we show that the average retrieval effectiveness degradation is around -17% when dealing with an error rate of 5%. This average decrease is around -46% when facing with an error rate of 20%. The representation by 4-grams tends to offer the best retrieval when searching with noisy text. Finally, we are not able to obtain clear conclusion about the impact of different stemming strategies or the use of blind-query expansion.
  • Publication
    Accès libre
    Indexing and stemming approaches for the Czech language
    Dolamic, Ljiljana
    ;
    This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf idf vector-space approach. We found that the Divergence from Randomness paradigm tend to propose better retrieval effectiveness than the Okapi, LM or tf idf models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.