Savoy, Jacques

Les processus basés sur le savoir, une des composantes essentielles de notre économie, requiert souvent un système multimodal de recherche d’information. De tels systèmes doivent traiter des collections de documents et des requêtes de plus en plus complexes. Cette complexité sous-jacente se situe dans le grand nombre et la diversité des modalités textuelles ou non-textuelles comme les coordonnées géographiques, les indications temporelles, ou les cotations apparaissant dans les documents. La combinaison de toutes ces modalités rend quasi-impossible la mise au point de nouvelles approches pour chaque modalité potentielle ou d’obtenir suffisamment de données d’apprentissage. Dès lors, l’un des objectifs de ce travail de thèse est de proposer un modèle unifié afin de traiter les diverses modalités en recherche d’information. De plus, nous avons développé des méthodes permettant la fusion de modalités avec peu ou en l’absence de données d’entrainement. Une telle contrainte s’avère essentielle pour des méthodes pouvant s’appliquer à un large éventail d’applications ou de domaines.
Nous avons fondé notre approche sur notre expérience touchant de nombreux systèmes multimodaux de recherche d’information. Dans un premier temps nous présentons une approche basée sur une distinction fondée sur deux types de modalités que nous subdiviserons par la suite. Ce choix correspond à une première approche dont l’objectif est de réduire le nombre possible de modèles. Elle permet de généraliser des méthodes traitant plusieurs modalités au lieu d’être spécifiques à une unique modalité.
Comme les schémas de pondération les plus populaires pour le dépistage d’information textuelle se sont généralisés avec succès dans de nombreuses tâches de recherche, nous les avons adoptés comme fondement à nos modèles unifiés traitant diverses modalités. Dans un deuxième temps, nous démontrons comment les trois composantes principales du modèle BM25 (fréquence d’occurrence, fréquence documentaire et normalisation selon la longueur du document) peuvent être redéfinies pour pouvoir traiter des modalités non-textuelles.
Dans un troisième temps, nous définissons des lignes directrices pour l’intégration de plusieurs modalités dans un système de dépistage de l’information. Dans ce but, BM25 s’avère un système de pondération permettant la fusion de modalités sous l’hypothèse des scores bruts (raw-score). Ce but est atteint par l’usage d’une approche basée sur l’échantillonnage qui est utilisée pour démontrer que BM25 satisfait les hypothèses de la fusion par les scores bruts (la longueur moyenne des documents et la variance de celle-ci).
En se basant sur notre redéfinition du modèle BM25 pouvant traiter à la fois les modalités textuelles et non-textuelles, nous avons testé notre approche par rapport à différentes références ainsi que lors de campagnes d’évaluation internationales de même que dans des contextes de production. Nous avons démontré que notre approche sans données d’apprentissage retournait une performance significativement supérieure à des systèmes classiques. De plus notre modèle (sans apprentissage) apporte des performances similaires à des systèmes basés sur une combinaison linéaire de modalités avec entraînement., Knowledge-intensive business processes, one of the essential drivers of our economy today, often rely on multimodal information retrieval systems that have to deal with increasingly complex document collections and queries. The complexity mainly evolves due to a large and diverse range of textual and non-textual modalities such as geographical coordinates, ratings and timestamps used in the collections. However, this results in a explosion of combinations of modalities, which makes it unfeasible to find new approaches for each individual modality and to obtain suitable training data. Therefore, one of the major goals of this dissertation is to develop unified models to treat modalities for document retrieval. Further, we aim to develop methods to merge the modalities with little or no training, which is essential for the methods to be applicable in a wide range of applications and application domains.
We base our approach on our experience with several multimodal information retrieval applications and thus also many different modalities. In a first step we suggest a coarse categorization of modalities into two types of modalities, which we further subdivide by their distribution. The categorization is a first attempt to reduce the number of different models. It helps to generalize methods to entire categories of modalities instead of being specific for a single modality.
Since the most popular weighting schemes for textual retrieval have generalized well to many retrieval tasks in the past, we propose to use them as a basis of the unified models for the categories of modalities. We therefore demonstrate as a second step how the three main components of the so-called BM25 weighting scheme (term frequency, document frequency and document length normalization) have to be redefined to be used with several non-textual modalities.
As a third step towards establishing clear guidance for the integration of many modalities into an information retrieval system, we demonstrate that BM25 is a suitable weighting scheme to merge modalities under the so-called raw-score merging hypothesis. We achieve this with the help of a sampling-based approach, which we use as a basis to prove that BM25 satisfies the assumptions of the raw-score merging hypothesis with respect to the average document length and the variance of document lengths.
Using our redefinition of BM25 for several non-textual modalities together with textual modalities, we finally build multimodal baselines and test them in evaluation campaigns as well as in operational information retrieval systems. We show that our untrained multimodal baselines reach a significantly better retrieval effectiveness than the textual baseline and even achieve similar performance when comparing them to a trained linear combination of the modality scores for some cases.

For our third participation in the CLEF evaluation campaign, our objective for both multilingual tracks is to propose a new merging strategy that does not require a training sample to access the multilingual collection. As a second objective, we want to verify whether our combined query translation approach would work well with new requests.

For our third participation in the CLEF evaluation campaign, our first objective was to propose more effective and general stop-word lists for the Swedish, Finnish and Russian languages, along with an improved, more efficient and simpler stemming procedure for these three languages. Our second goal was to suggest a combined search approach based on a data fusion strategy that would work with various European languages. Included in this combined approach is a decompounding strategy for the German, Dutch, Swedish and Finnish languages.

In our second participation in the CLEF retrieval tasks, our first objective was to propose better and more general stopword lists for various European languages (namely, French, Italian, German, Spanish and Finnish) along with improved, simpler and efficient stemming procedures. Our second goal was to propose a combined query-translation approach that could cross language barriers and also an effective merging strategy based on logistic regression for accessing the multilingual collection. Finally, within the Amaryllis experiment, we wanted to analyze how a specialized thesaurus might improve retrieval effectiveness.

Savoy, Jacques

Résultat de la recherche

Filtres

Auteur

Éditeur

Institution

Sujet

Fichier(s) présent(s)

Type

Paramètres

Trier par

Résultats par page

Options

Savoy, Jacques

Résultat de la recherche

Filtres

Auteur

Éditeur

Institution

Sujet

Fichier(s) présent(s)

Type

Paramètres

Trier par

Résultats par page