Voici les éléments 1 - 10 sur 20
Pas de vignette d'image disponible
Publication
Métadonnées seulement

When stopword lists make the difference

2010, Dolamic, Ljiljana, Savoy, Jacques

Vignette d'image
Publication
Accès libre

Influence of language morphological complexity on information retrieval

2010, Dolamic, Ljiljana, Savoy, Jacques, Kropf, Peter

In this dissertation two aspects of information retrieval are elaborated. The frst involves the creation and evaluation of various linguistic tools for languages less studied than English, and in our case we have chosen to work with the two Slavic languages Czech and Russian, and three languages widely spoken on the Indian subcontinent, Hindi, Marathi and Bengali. To do so we compare various indexing strategies and IR models most likely to obtain the best possible performance. The second part involves an evaluation of the effectiveness of queries written in different languages when searching collections written in either English or French. To cross the language barriers we apply publicly available machine translation services, analyze the results and then explain the poor performances obtained by the translated queries.

Vignette d'image
Publication
Accès libre

Indexing and searching strategies for the Russian language

2009, Dolamic, Ljiljana, Savoy, Jacques

This paper describes and evaluates various stemming and indexing strategies for the Russian language. We design and evaluate two stemming approaches, a light and a more aggressive one, and compare these stemmers to the Snowball stemmer, to no stemming, and also to a language-independent approach (n-gram). To evaluate the suggested stemming strategies we apply various probabilistic information retrieval (IR) models, including the Okapi, the Divergence from Randomness (DFR), a statistical language model (LM), as well as two vector-space approaches, namely, the classical tf idf scheme and the dtu-dtn model. We find that the vector-space dtu-dtn and the DFR models tend to result in better retrieval effectiveness than the Okapi, LM, or tf idf models, while only the latter two IR approaches result in statistically significant performance differences. Ignoring stemming generally reduces the MAP by more than 50%, and these differences are always significant. When applying an n-gram approach, performance differences are usually lower than an approach involving stemming. Finally, our light stemmer tends to perform best, although performance differences between the light, aggressive, and Snowball stemmers are not statistically significant.

Pas de vignette d'image disponible
Publication
Métadonnées seulement

UniNE at CLEF 2008: TEL, Persian and Robust IR

2008, Dolamic, Ljiljana, Fautsch, Claire, Savoy, Jacques

Pas de vignette d'image disponible
Publication
Métadonnées seulement

Comparative study of indexing and search strategies for the hindi, marathi, and bengali languages

2010, Dolamic, Ljiljana, Savoy, Jacques

Pas de vignette d'image disponible
Publication
Métadonnées seulement

Indexing and stemming approaches for the Czech language

2009, Dolamic, Ljiljana, Savoy, Jacques

Vignette d'image
Publication
Accès libre

How effective is Google's translation service in search?

2009, Savoy, Jacques, Dolamic, Ljiljana

In multilingual countries (Canada, Hong Kong, India, among others) and large international organizations or companies (such as, WTO, European Parliament), and among Web users in general, accessing information written in other languages has become a real need (news, hotel or airline reservations, or government information, statistics). While some users are bilingual, others can read documents written in another language but cannot formulate a query to search it, or at least cannot provide reliable search terms in a form comparable to those found in the documents being searched. There are also many monolingual users who may want to retrieve documents in another language and then have them translated into their own language, either manually or automatically.
Translation services may however be too expensive, not readily accessible or not available within a short timeframe. On the other hand, many documents contain non-textual information such as images, videos and statistics that do not need translation and can be understood regardless of the language involved. In response to these needs and in order to make the Web universally available regardless of any language barriers, in May 2007 Google launched a translation service that now provides two-way online translation services mainly between English and 41 other languages, for example, Arabic, simplified and traditional Chinese, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish (http://translate.google.com/). Over the last few years other free Internet translation services have been made available as for example by BabelFish (http://babel.altavista.com/) or Yahoo! (http://babelfish.yahoo.com/). These two systems are similar to that used by Google, given they are based on technology developed by Systran, one of the earliest companies to develop machine translation. Also worth mentioning here is the Promt system (also known as Reverso, http://translation2.paralink.com/), which was developed in Russia to provide mainly translation between Russian and other languages.
The question we would like to address here is to what extent a translation service such as Google can produce adequate results in the language other than that being used to write the query. Although we will not evaluate translations per se we will test and analyze various systems in terms of their ability to retrieve items automatically based on a translated query. To be adequate, these tests must be done on a collection of documents written in one given language plus a series of topics (expressing user information needs) written in other languages, plus a series of relevance assessments (relevant documents for each topic).

Pas de vignette d'image disponible
Publication
Métadonnées seulement

Retrieval effectiveness of machine translated queries

2010, Dolamic, Ljiljana, Savoy, Jacques

Vignette d'image
Publication
Accès libre

UniNE at CLEF 2008: TEL, and Persian IR

2009, Dolamic, Ljiljana, Abdou, Samir, Savoy, Jacques

In our participation in this evaluation campaign, our first objective was to analyze retrieval effectiveness when using The European Library (TEL) corpora composed of very short descriptions (library catalog records) and also to evaluate the retrieval effectiveness of several IR models. As a second objective we wanted to design and evaluate a stopword list and a light stemming strategy for the Persian (Farsi), a member of the Indo-European family of languages and whose morphology is more complex than of the English language.

Vignette d'image
Publication
Accès libre

When stopword lists make the difference

2009, Dolamic, Ljiljana, Savoy, Jacques

In this brief communication, we evaluate the use of two stopword lists for the English language (one comprising 571 words and another with 9) and compare them with a search approach accounting for all word forms. We show that through implementing the original Okapi form or certain ones derived from the Divergence from Randomness (DFR) paradigm, significantly lower performance levels may result when using short or no stopword lists. For other DFR models and a revised Okapi implementation, performance differences between approaches using short or long stopword lists or no list at all are usually not statistically significant. Similar conclusions can be drawn when using other natural languages such as French, Hindi, or Persian.