Voici les éléments 1 - 4 sur 4
Pas de vignette d'image disponible
Publication
Accès libre

Information retrieval of digitized medieval manuscripts

2013, Naji, Nada, Savoy, Jacques

This dissertation investigates the retrieval of noisy texts in general and digitized historical manuscripts in particular. The noise originates from several sources, these include imperfect text recognition (6% word error rate), spelling variation, non-standardized grammar, in addition to user-side confusion due to her/his limited knowledge of the underlying language and/or the searched text. Manual correction or normalization are very time-consuming and resource-demanding tasks and are thus out of the question. Furthermore, external resources, such as thesauri, are not available for the older, lesser-known languages. In this dissertation, we present our contributions to overcoming or at least coping with these issues. We developed several methods that provide a low-cost yet highly-effective text representation to limit the negative impact of recognition error and the variable orthography and morphology. Finally, to account for the user-confusion problem, we developed a low-cost query enrichment function which we deem indispensable for the challenging task of one-word queries.

Pas de vignette d'image disponible
Publication
Accès libre

Recherche d’information dans un corpus bruité (OCR)

, Naji, Nada, Savoy, Jacques, Dolamic, Ljiljana

Cet article désire mesurer la perte de performance lors de la recherche d'information dans une collection de documents scannés. Disposant d'un corpus sans erreur et de deux versions renfermant 5 % et 20 % d'erreurs en reconnaissance, nous avons évalué six modèles de recherche d'information basés sur trois représentations des documents (sac de mots, n-grammes, ou trunc-n) et trois enracineurs. Basé sur l'inverse du rang du premier document pertinent dépisté, nous démontrons que la perte de performance se situe aux environs de - 17 % avec un taux d'erreur en reconnaissance de 5 % et s'élève à – 46 % si ce taux grimpe à 20 %. La représentation par 4-grammes semble apporter une meilleure qualité de réponse avec un corpus bruité. Concernant l'emploi ou non d'un enracineur léger ou la pseudo-rétroaction positive, aucune conclusion définitive ne peut être tirée., This paper evaluates the retrieval effectiveness degradation when facing with noisy text corpus. With the use of a test-collection having the clean text, another version with around 5% error rate in recognition and a third with 20% error rate, we have evaluated six IR models based on three text representations (bag-of-words, n-grams, trunc-n) as well as three stemmers. Using the mean reciprocal rank as performance measure, we show that the average retrieval effectiveness degradation is around -17% when dealing with an error rate of 5%. This average decrease is around -46% when facing with an error rate of 20%. The representation by 4-grams tends to offer the best retrieval when searching with noisy text. Finally, we are not able to obtain clear conclusion about the impact of different stemming strategies or the use of blind-query expansion.

Pas de vignette d'image disponible
Publication
Accès libre

Etude comparative de l’efficacité du dépistage de l’information dans des manuscrits médiévaux

, Naji, Nada, Savoy, Jacques

This paper presents, evaluates and compares the effectiveness of information retrieval (IR) for medieval manuscripts when facing with noisy texts. The corpus used in our experiments is based on a well-known medieval epic poem written in Middle High German dating to the thirteenth century (Parzival). An error-free transcription of the poem was created manually and made available by experts. This error-free transcription represents our baseline that we used to assess the performance levels. In practice, the document noise could be caused by different sources (e.g., spelling variations due to non-normalized medieval text, recognition mistake). To overcome these difficulties, we suggest several query expansion strategies, hence allowing some form of spelling variation between the requests and the searchable items. To analyze these performances under several conditions, we have evaluated five IR models, three forms of stemming and three text representations. We show that incorporating the maximum spelling variation possibilities in the query expansion process does not produce the best results, while a wiser and more conservative approach of involving expansion terms yields better performance levels.

Pas de vignette d'image disponible
Publication
Accès libre

Information Retrieval Strategies for Digitized Handwritten Medieval Documents

, Naji, Nada, Savoy, Jacques

This paper describes and evaluates different IR models and search strategies for digitized manuscripts. Written during the thirteenth century, these manuscripts were digitized using an imperfect recognition system with a word error rate of around 6%. Having access to the internal representation during the recognition stage, we were able to produce four automatic transcriptions, each introducing some form of spelling correction as an attempt to improve the retrieval effectiveness. We evaluated the retrieval effectiveness for each of these versions using three text representations combined with five IR models, three stemming strategies and two query formulations. We employed a manually-transcribed error-free version to define the ground-truth. Based on our experiments, we conclude that taking account of the single best recognition word or all possible top-k recognition alternatives does not provide the best performance. Selecting all possible words each having a log-likelihood close to the best alternative yields the best text surrogate. Within this representation, different retrieval strategies tend to produce similar performance levels.