Voici les éléments 1 - 2 sur 2
  • Publication
    Métadonnées seulement
    An advanced stemming algorithm for creating concept signatures of medical terms
    (Godalming: Springer-Verlag London Ltd, 2002)
    Kurz, Thorsten
    ;
    ;
    Bramer, Max
    ;
    Coenen, Frans
    ;
    Preece, Alun
    We present a stemming algorithm that does not only remove the endings of words, but also separates prefixes and suffixes from the remaining stem. The output of this algorithm creates more precise concept signatures for indexing and classifying documents. The algorithm has been successfully tested with prefix and suffix lists for medical terms.
  • Publication
    Métadonnées seulement
    Selecting optimal split-functions for large datasets
    (Godalming: Springer-Verlag London Ltd, 2001) ;
    Raileanu, Laura Elena
    ;
    Bramer, Max
    ;
    Preece, Alun
    ;
    Coenen, Frans
    Decision tree induction has become one of the most popular methods for classification and prediction. The key step in the process of inferring decision trees is finding the right criteria for splitting the training set into smaller and smaller subsets so that, ideally, all elements of a subset finally belong to one class. These split criteria can be defined in different ways (e.g. minimizing impurity of a subset, or minimizing entropy in a subset), and therefore they emphasize different properties of the inferred tree, such as size or classification accuracy. In this paper we analyze if the split functions introduced in a statistical and machine learning context are also well suited for a KDD context. We selected two well known split functions, namely Gini Index (CART) and Information Gain (C4.5) and introduced our own family of split functions and tested them on 9,000 data sets of different sizes (from 200 to 20, 000 tuples). The tests have shown that the two popular functions are very sensitive to the variation of the training set sizes and therefore the quality of the inferred trees is highly dependent on the training set size. At the same time however, we were able to show that the simplest members of the introduced family of split functions behave in a very predictable way and, furthermore, the created trees were superior to the trees inferred using the Gini Index or the Information Gain based on our evaluation criteria.