Formalization and Comparison of Split Criteria for Decision Trees
Responsable du projet Kilian Stoffel
Collaborateur Laura Raileanu
Résumé This project is about decision trees in the context of large data sets. To achieve more fundamental insights on how to select split criteria, we propose a formal methodology that permits to compare multiple split criteria. We introduce a new family of split functions, which have a more stable behavior than the classical split criteria when larger and larger data sets are used to construct decision trees. Our new technique of sampling applied to large data sets guarantees that the decision trees inferred from the whole database as well as from the sampled database are almost always identical.
Our research is concentrated on decision trees in the context of large data sets. In general, three main questions about the induction of decision trees are investigated:

1. Will the Gini Index and Information Gain split criteria select the same attribute to split on in any situation? If not, what are the conditions satisfied by the database's parameters?
2. What will be the behavior of the Gini Index and Information Gain split functions when the size of the database from which the decision trees are inferred is increased? Will exist other split functions with a more stable behavior in the case of increasing size of databases?
3. Will the trees built from large sets of examples, using the Gini Index and information Gain criteria, be equivalently to those constructed from the sampled sets of examples? What is the suitable procedure of sampling to be applied in order to keep the same structure and accuracy of the decision trees inferred from the un-sampled data sets?

The given answers are the following:

I. A formal methodology is introduced us to analytically compare multiple split criteria.
II. The efficiency of existing decision tree algorithms has been well established for relatively small data sets. Efficiency and scalability become issues of concern when these algorithms are applied to the mining of very large real-world databases. A new family of split functions is introduced and we are able to show, that some split functions of our family behave in a very stable way and, furthermore, the created trees were superior to the trees inferred using the classical split criteria.
III. The enormity of some data sets used in recent practical applications prompts an investigation of whether such training sets are necessary and how they might be reduced without sacrificing accuracy. Basing the construction of decision trees using our new technique of sampling we assure the creation of trees equivalent in structure and quality with those we would have obtained if the whole database would had been used. This technique of compaction of examples of the data set used before the application of decision tree algorithm is clearly a useful weapon to have when attacking huge data sets.
Mots-clés decision trees, Gini index, split criteria
Page internet http://www2.unine.ch/imi/page-17355.html
Type de projet Recherche fondamentale
Domaine de recherche Data Mining
Source de financement FNS
Etat Terminé
Début de projet 1-10-1999
Fin du projet 30-9-2001
Budget alloué 74 258
Contact Paul Cotofrei