Multilingual and Domain-Specific Information Retrieval

Titre du projet

Description

In information retrieval (IR), the English language has been studied for many years, and various linguistics tools have been suggested and evaluated for this language. In this research proposal we are targeting three main objectives. Our first objective is to design, implement and evaluate IR systems that work with various non-English languages (monolingual IR). More specifically, in this part we want to begin with less frequently used languages (and new from an IR perspective), such as Persian, Hindi, Marathi and other Indian languages. This set of languages covers various branches of the Indo-European family, but we tackle the Uralic languages (Turkish) as well as the Dravidian languages (Tamil, Telugu), in order to provide a basis of comparison for our tests. Translating an expression of a user need is clearly a less expensive translation strategy than translating the entire corpus into a common language. Thus as a second objective, we want to design, evaluate and improve translation procedures used in query formulation (with one of the languages pairs being English). As a third and most important objective, we want to continue investigating domain-specific IR systems used to retrieve information in a given field of knowledge (e.g., intellectual property (IP) or patents), instead of evaluating IR systems using newspaper test-collections. We thus want to investigate how we can improve retrieval effectiveness when considering only a specific domain. In this case, we may make use of a specialized thesaurus (e.g., as in the GIRT track at CLEF). We also want to improve search quality by analyzing general document structure (e.g., a patent is usually divided into an abstract, a disclosure section, claims, drawings and references, each section having, from an IR point of view, varying importance). We also want to investigate and evaluate the impact of orthographic and vocabulary variations (both within a given language (e.g., Telugu) and proper name variations between different languages). Finally, in our efforts to further enhance retrieval effectiveness, extra-document information (e.g., document contexts, tables and references inside a patent, links between documents) may also be analyzed. Moreover, we would like to suggest an IR system capable of automatically carrying out computations using publicly attainable resources. In doing so we could exclude any retrieval systems requiring extensive manual work

Chercheur principal

Savoy, Jacques

Akasereh, Mitra

Statut

Completed

Date de début

1 Mars 2011

Date de fin

28 Février 2014

Organisations

Institut d'informatique

Identifiant interne

15034

identifiant

https://libra.unine.ch/handle/123456789/1447

Mots-clés

Options

Multilingual and Domain-Specific Information Retrieval