Options
Multilingual and Domain-Specific Information Retrieval
Titre du projet
Multilingual and Domain-Specific Information Retrieval
Description
In information retrieval (IR), the English language has been studied for many years, and various linguistics tools have been suggested and evaluated for this language. In this research proposal we are targeting three main objectives. Our first objective is to design, implement and evaluate IR systems that work with various non-English languages (monolingual IR). More specifically, in this part we want to begin with less frequently used languages (and new from an IR perspective), such as Persian, Hindi, Marathi and other Indian languages. This set of languages covers various branches of the Indo-European family, but we tackle the Uralic languages (Turkish) as well as the Dravidian languages (Tamil, Telugu), in order to provide a basis of comparison for our tests. Translating an expression of a user need is clearly a less expensive translation strategy than translating the entire corpus into a common language. Thus as a second objective, we want to design, evaluate and improve translation procedures used in query formulation (with one of the languages pairs being English). As a third and most important objective, we want to continue investigating domain-specific IR systems used to retrieve information in a given field of knowledge (e.g., intellectual property (IP) or patents), instead of evaluating IR systems using newspaper test-collections. We thus want to investigate how we can improve retrieval effectiveness when considering only a specific domain. In this case, we may make use of a specialized thesaurus (e.g., as in the GIRT track at CLEF). We also want to improve search quality by analyzing general document structure (e.g., a patent is usually divided into an abstract, a disclosure section, claims, drawings and references, each section having, from an IR point of view, varying importance). We also want to investigate and evaluate the impact of orthographic and vocabulary variations (both within a given language (e.g., Telugu) and proper name variations between different languages). Finally, in our efforts to further enhance retrieval effectiveness, extra-document information (e.g., document contexts, tables and references inside a patent, links between documents) may also be analyzed. Moreover, we would like to suggest an IR system capable of automatically carrying out computations using publicly attainable resources. In doing so we could exclude any retrieval systems requiring extensive manual work
Chercheur principal
Akasereh, Mitra
Statut
Completed
Date de début
1 Mars 2011
Date de fin
28 Février 2014
Organisations
Identifiant interne
15034
identifiant
Mots-clés