Voici les éléments 1 - 10 sur 20
  • Publication
    Métadonnées seulement
    When stopword lists make the difference
    (2010)
    Dolamic, Ljiljana
    ;
  • Publication
    Accès libre
    Influence of language morphological complexity on information retrieval
    (2010)
    Dolamic, Ljiljana
    ;
    ;
    In this dissertation two aspects of information retrieval are elaborated. The frst involves the creation and evaluation of various linguistic tools for languages less studied than English, and in our case we have chosen to work with the two Slavic languages Czech and Russian, and three languages widely spoken on the Indian subcontinent, Hindi, Marathi and Bengali. To do so we compare various indexing strategies and IR models most likely to obtain the best possible performance. The second part involves an evaluation of the effectiveness of queries written in different languages when searching collections written in either English or French. To cross the language barriers we apply publicly available machine translation services, analyze the results and then explain the poor performances obtained by the translated queries.
  • Publication
    Métadonnées seulement
  • Publication
    Accès libre
    UniNE at CLEF 2008: TEL, and Persian IR
    (2009)
    Dolamic, Ljiljana
    ;
    Abdou, Samir
    ;
    In our participation in this evaluation campaign, our first objective was to analyze retrieval effectiveness when using The European Library (TEL) corpora composed of very short descriptions (library catalog records) and also to evaluate the retrieval effectiveness of several IR models. As a second objective we wanted to design and evaluate a stopword list and a light stemming strategy for the Persian (Farsi), a member of the Indo-European family of languages and whose morphology is more complex than of the English language.
  • Publication
    Accès libre
    When stopword lists make the difference
    (2009)
    Dolamic, Ljiljana
    ;
    In this brief communication, we evaluate the use of two stopword lists for the English language (one comprising 571 words and another with 9) and compare them with a search approach accounting for all word forms. We show that through implementing the original Okapi form or certain ones derived from the Divergence from Randomness (DFR) paradigm, significantly lower performance levels may result when using short or no stopword lists. For other DFR models and a revised Okapi implementation, performance differences between approaches using short or long stopword lists or no list at all are usually not statistically significant. Similar conclusions can be drawn when using other natural languages such as French, Hindi, or Persian.
  • Publication
    Accès libre
    How effective is Google's translation service in search?
    (2009) ;
    Dolamic, Ljiljana
    In multilingual countries (Canada, Hong Kong, India, among others) and large international organizations or companies (such as, WTO, European Parliament), and among Web users in general, accessing information written in other languages has become a real need (news, hotel or airline reservations, or government information, statistics). While some users are bilingual, others can read documents written in another language but cannot formulate a query to search it, or at least cannot provide reliable search terms in a form comparable to those found in the documents being searched. There are also many monolingual users who may want to retrieve documents in another language and then have them translated into their own language, either manually or automatically.
    Translation services may however be too expensive, not readily accessible or not available within a short timeframe. On the other hand, many documents contain non-textual information such as images, videos and statistics that do not need translation and can be understood regardless of the language involved. In response to these needs and in order to make the Web universally available regardless of any language barriers, in May 2007 Google launched a translation service that now provides two-way online translation services mainly between English and 41 other languages, for example, Arabic, simplified and traditional Chinese, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish (http://translate.google.com/). Over the last few years other free Internet translation services have been made available as for example by BabelFish (http://babel.altavista.com/) or Yahoo! (http://babelfish.yahoo.com/). These two systems are similar to that used by Google, given they are based on technology developed by Systran, one of the earliest companies to develop machine translation. Also worth mentioning here is the Promt system (also known as Reverso, http://translation2.paralink.com/), which was developed in Russia to provide mainly translation between Russian and other languages.
    The question we would like to address here is to what extent a translation service such as Google can produce adequate results in the language other than that being used to write the query. Although we will not evaluate translations per se we will test and analyze various systems in terms of their ability to retrieve items automatically based on a translated query. To be adequate, these tests must be done on a collection of documents written in one given language plus a series of topics (expressing user information needs) written in other languages, plus a series of relevance assessments (relevant documents for each topic).
  • Publication
    Métadonnées seulement
  • Publication
    Accès libre
    Indexing and searching strategies for the Russian language
    (2009)
    Dolamic, Ljiljana
    ;
    This paper describes and evaluates various stemming and indexing strategies for the Russian language. We design and evaluate two stemming approaches, a light and a more aggressive one, and compare these stemmers to the Snowball stemmer, to no stemming, and also to a language-independent approach (n-gram). To evaluate the suggested stemming strategies we apply various probabilistic information retrieval (IR) models, including the Okapi, the Divergence from Randomness (DFR), a statistical language model (LM), as well as two vector-space approaches, namely, the classical tf idf scheme and the dtu-dtn model. We find that the vector-space dtu-dtn and the DFR models tend to result in better retrieval effectiveness than the Okapi, LM, or tf idf models, while only the latter two IR approaches result in statistically significant performance differences. Ignoring stemming generally reduces the MAP by more than 50%, and these differences are always significant. When applying an n-gram approach, performance differences are usually lower than an approach involving stemming. Finally, our light stemmer tends to perform best, although performance differences between the light, aggressive, and Snowball stemmers are not statistically significant.
  • Publication
    Accès libre
    Variations autour de tf idf et du moteur Lucene
    (2008) ;
    Dolamic, Ljiljana
    A l'aide d’un corpus écrit en langue française et composé de 299 requêtes, cet article analyse et compare l’efficacité du dépistage de diverses stratégies d’indexation et de recherche basées sur le modèle classique « tf idf ». Cette dernière formulation demeure ambiguë et cache diverses variantes possédant des performances différentes, performance mesurée soit par la précision moyenne (MAP) soit par le rang moyen de la première bonne réponse (MRR). Notre analyse confirme que la meilleure efficacité s’obtient par le modèle Okapi. Mais lorsque nous sommes dans des contextes particuliers (e.g., systèmes distribués) dans lesquels la valeur de l’idf n’est pas connue lors de l’indexation des documents, nous démontrons que des stratégies simples, basées uniquement sur la fréquence d’occurrence (ou tf) permettent d’obtenir une performance significativement meilleure que le modèle classique « tf idf ». En utilisant le moteur Lucene (logiciel libre), nous avons également évalué deux de ses facettes, à savoir l’accroissement d’importance attachée aux mots des titres et la prise en compte du nombre de termes en commun entre le document dépisté et la requête., This paper evaluates and compares the retrieval effectiveness resulting from various models derived from the classical tf idf paradigm when searching into a test-collection written in the French language (CLEF, 299 queries). We show that the simple paradigm “tf idf” may hide various formulations providing different retrieval effectiveness measured either by the mean average precision (MAP) or the mean reciprocal rank (MRR). Our analysis demonstrates that the best retrieval performance can be obtained from applying the Okapi probabilistic model. However, when faced with particular contexts (e.g. distributed IR) where the idf value cannot be obtained during the indexing process, we demonstrated that a simple indexing scheme (based only the frequency of occurrence or tf) may produce a significantly better performance than the classical « tf idf » model. Using the Lucene search engine, we also analyze and evaluate two particular features of this open-source system (namely the boost and coordinate level match).