Options
Multilingual and Contextual Information Retrieval
Titre du projet
Multilingual and Contextual Information Retrieval
Description
This research proposal focuses on three main objectives. First, we want to design, implement and evaluate information retrieval (IR) systems to work with various East European languages (non-English monolingual IR). More specifically, in this part we design and evaluate linguistic tools for new and less frequently spoken languages, such as Hungarian, Polish, Czech and Turkish. In this part we also translate a short query from one language to another language (most likely it will be English, the lingua franca, before accessing information written in the various other languages).
Second, we undertake a more elaborate investigation of contextual IR systems used to retrieve information in a specific domain (e.g., biomedicine, law, enterprise, webblog), instead of evaluating IR systems using newspaper test-collections. In this part of our project we investigate the most appropriate response to user information needs (varying from “classical” document searches to new requests such as known-item searches (“where is the last e-mail sent to Paul?”), pros/cons of a given argument, searches for an expert in a given domain based on e-mails or other enterprise intranet document repositories, etc.). Specific users specifications could also be considered through identifying document length (varying from a short bibliographic notice to a large novel), the level of information needed (whole document, paragraph, single sentence or short summary), and the degree of editorial control (from newspaper articles to e-mails or webblogs). In this second part we also investigate and evaluate the impact of orthographic and vocabulary variations as well as the influence of extra-document information (e.g., document contexts, temporal information, links between documents within web or legal corpuses).
Third, we integrate the above two research objectives into a common task, in order to perform searches in a multilingual collection, starting with relatively well edited web pages (e.g., information made available from the European governments when using the EuroGOV corpus), or even less structured and less “polished” web pages (e.g., webblogs written in at least three different languages) or enterprise e-mails.
Second, we undertake a more elaborate investigation of contextual IR systems used to retrieve information in a specific domain (e.g., biomedicine, law, enterprise, webblog), instead of evaluating IR systems using newspaper test-collections. In this part of our project we investigate the most appropriate response to user information needs (varying from “classical” document searches to new requests such as known-item searches (“where is the last e-mail sent to Paul?”), pros/cons of a given argument, searches for an expert in a given domain based on e-mails or other enterprise intranet document repositories, etc.). Specific users specifications could also be considered through identifying document length (varying from a short bibliographic notice to a large novel), the level of information needed (whole document, paragraph, single sentence or short summary), and the degree of editorial control (from newspaper articles to e-mails or webblogs). In this second part we also investigate and evaluate the impact of orthographic and vocabulary variations as well as the influence of extra-document information (e.g., document contexts, temporal information, links between documents within web or legal corpuses).
Third, we integrate the above two research objectives into a common task, in order to perform searches in a multilingual collection, starting with relatively well edited web pages (e.g., information made available from the European governments when using the EuroGOV corpus), or even less structured and less “polished” web pages (e.g., webblogs written in at least three different languages) or enterprise e-mails.
Chercheur principal
Statut
Completed
Date de début
1 Janvier 2007
Date de fin
31 Mars 2010
Organisations
Identifiant interne
32695
identifiant