Options
Selecting optimal split-functions for large datasets
Auteur(s)
Raileanu, Laura Elena
Editeur(s)
Bramer, Max
Preece, Alun
Coenen, Frans
Maison d'édition
Godalming: Springer-Verlag London Ltd
Date de parution
2001
In
Research and Development in Intelligent Systems Xvii
De la page
62
A la page
72
Collection
Research and Development in Intelligent Systems Xvii
Résumé
Decision tree induction has become one of the most popular methods for classification and prediction. The key step in the process of inferring decision trees is finding the right criteria for splitting the training set into smaller and smaller subsets so that, ideally, all elements of a subset finally belong to one class. These split criteria can be defined in different ways (e.g. minimizing impurity of a subset, or minimizing entropy in a subset), and therefore they emphasize different properties of the inferred tree, such as size or classification accuracy. In this paper we analyze if the split functions introduced in a statistical and machine learning context are also well suited for a KDD context. We selected two well known split functions, namely Gini Index (CART) and Information Gain (C4.5) and introduced our own family of split functions and tested them on 9,000 data sets of different sizes (from 200 to 20, 000 tuples). The tests have shown that the two popular functions are very sensitive to the variation of the training set sizes and therefore the quality of the inferred trees is highly dependent on the training set size. At the same time however, we were able to show that the simplest members of the introduced family of split functions behave in a very predictable way and, furthermore, the created trees were superior to the trees inferred using the Gini Index or the Information Gain based on our evaluation criteria.
Identifiants
Type de publication
book part