Multilingual and Domain-Specific Information Retrieval
Responsable du projet Jacques Savoy
Collaborateur Mitra Akasereh
Résumé In information retrieval (IR), the English language has been studied for many years, and various linguistics tools have been suggested and evaluated for this language. In this research proposal we are targeting three main objectives. Our first objective is to design, implement and evaluate IR systems that work with various non-English languages (monolingual IR). More specifically, in this part we want to begin with less frequently used languages (and new from an IR perspective), such as Persian, Hindi, Marathi and other Indian languages. This set of languages covers various branches of the Indo-European family, but we tackle the Uralic languages (Turkish) as well as the Dravidian languages (Tamil, Telugu), in order to provide a basis of comparison for our tests. Translating an expression of a user need is clearly a less expensive translation strategy than translating the entire corpus into a common language. Thus as a second objective, we want to design, evaluate and improve translation procedures used in query formulation (with one of the languages pairs being English). As a third and most important objective, we want to continue investigating domain-specific IR systems used to retrieve information in a given field of knowledge (e.g., intellectual property (IP) or patents), instead of evaluating IR systems using newspaper test-collections. We thus want to investigate how we can improve retrieval effectiveness when considering only a specific domain. In this case, we may make use of a specialized thesaurus (e.g., as in the GIRT track at CLEF). We also want to improve search quality by analyzing general document structure (e.g., a patent is usually divided into an abstract, a disclosure section, claims, drawings and references, each section having, from an IR point of view, varying importance). We also want to investigate and evaluate the impact of orthographic and vocabulary variations (both within a given language (e.g., Telugu) and proper name variations between different languages). Finally, in our efforts to further enhance retrieval effectiveness, extra-document information (e.g., document contexts, tables and references inside a patent, links between documents) may also be analyzed. Moreover, we would like to suggest an IR system capable of automatically carrying out computations using publicly attainable resources. In doing so we could exclude any retrieval systems requiring extensive manual work
Mots-clés Information retrieval, multilingual information retrieval, domain-specific IR, contextual retrieval, cross-lingual IR (CLIR), digital library, Information retrieval (IR), multilingual IR (MLIR)
Type de projet Recherche fondamentale
Domaine de recherche Informatique
Source de financement FNS - Encouragement de projets (Div. I-III)
Etat Terminé
Début de projet 1-3-2011
Fin du projet 28-2-2014
Budget alloué 171'510.00
Autre information http://p3.snf.ch/projects-129535#
Contact Jacques Savoy