Selecting optimal split-functions for large datasets

Stoffel, Kilian; Raileanu, Laura Elena

Selecting optimal split-functions for large datasets

Auteur(s)

Stoffel, Kilian

Institut du management de l'information

Raileanu, Laura Elena

Editeur(s)

Bramer, Max

Preece, Alun

Coenen, Frans

Maison d'édition

Godalming: Springer-Verlag London Ltd

Date de parution

2001

In

Research and Development in Intelligent Systems Xvii

De la page

62

A la page

72

Collection

Research and Development in Intelligent Systems Xvii

Résumé

Decision tree induction has become one of the most popular methods for classification and prediction. The key step in the process of inferring decision trees is finding the right criteria for splitting the training set into smaller and smaller subsets so that, ideally, all elements of a subset finally belong to one class. These split criteria can be defined in different ways (e.g. minimizing impurity of a subset, or minimizing entropy in a subset), and therefore they emphasize different properties of the inferred tree, such as size or classification accuracy. In this paper we analyze if the split functions introduced in a statistical and machine learning context are also well suited for a KDD context. We selected two well known split functions, namely Gini Index (CART) and Information Gain (C4.5) and introduced our own family of split functions and tested them on 9,000 data sets of different sizes (from 200 to 20, 000 tuples). The tests have shown that the two popular functions are very sensitive to the variation of the training set sizes and therefore the quality of the inferred trees is highly dependent on the training set size. At the same time however, we were able to show that the simplest members of the introduced family of split functions behave in a very predictable way and, furthermore, the created trees were superior to the trees inferred using the Gini Index or the Information Gain based on our evaluation criteria.

Identifiants

https://libra.unine.ch/handle/123456789/11297

Type de publication

book part

google-scholar

Options

Selecting optimal split-functions for large datasets