Repository logo
Research Data
Publications
Projects
Persons
Organizations
English
Français
Log In(current)
  1. Home
  2. Authorities
  3. Projets
  4. Formalization and Comparison of Split Criteria for Decision Trees
Project Title
Formalization and Comparison of Split Criteria for Decision Trees
Internal ID
17340
Principal Investigator
Stoffel, Kilian  
Status
Completed
Start Date
October 1, 1999
End Date
September 30, 2001
Investigators
Raileanu, Laura
Organisations
Institut du management de l'information  
Project Web Site
http://www2.unine.ch/imi/page-17355.html
Identifiants
https://libra.unine.ch/handle/20.500.14713/2811
-
https://libra.unine.ch/handle/123456789/1577
Keywords
decision trees Gini index split criteria
Description
This project is about decision trees in the context of large data sets. To achieve more fundamental insights on how to select split criteria, we propose a formal methodology that permits to compare multiple split criteria. We introduce a new family of split functions, which have a more stable behavior than the classical split criteria when larger and larger data sets are used to construct decision trees. Our new technique of sampling applied to large data sets guarantees that the decision trees inferred from the whole database as well as from the sampled database are almost always identical.
Our research is concentrated on decision trees in the context of large data sets. In general, three main questions about the induction of decision trees are investigated:

1. Will the Gini Index and Information Gain split criteria select the same attribute to split on in any situation? If not, what are the conditions satisfied by the database's parameters?
2. What will be the behavior of the Gini Index and Information Gain split functions when the size of the database from which the decision trees are inferred is increased? Will exist other split functions with a more stable behavior in the case of increasing size of databases?
3. Will the trees built from large sets of examples, using the Gini Index and information Gain criteria, be equivalently to those constructed from the sampled sets of examples? What is the suitable procedure of sampling to be applied in order to keep the same structure and accuracy of the decision trees inferred from the un-sampled data sets?

The given answers are the following:

I. A formal methodology is introduced us to analytically compare multiple split criteria.
II. The efficiency of existing decision tree algorithms has been well established for relatively small data sets. Efficiency and scalability become issues of concern when these algorithms are applied to the mining of very large real-world databases. A new family of split functions is introduced and we are able to show, that some split functions of our family behave in a very stable way and, furthermore, the created trees were superior to the trees inferred using the classical split criteria.
III. The enormity of some data sets used in recent practical applications prompts an investigation of whether such training sets are necessary and how they might be reduced without sacrificing accuracy. Basing the construction of decision trees using our new technique of sampling we assure the creation of trees equivalent in structure and quality with those we would have obtained if the whole database would had been used. This technique of compaction of examples of the data set used before the application of decision tree algorithm is clearly a useful weapon to have when attacking huge data sets.
Université de Neuchâtel logo

Service information scientifique & bibliothèques

Rue Emile-Argand 11

2000 Neuchâtel

contact.libra@unine.ch

Service informatique et télématique

Rue Emile-Argand 11

Bâtiment B, rez-de-chaussée

Powered by DSpace-CRIS

libra v2.1.0

© 2025 Université de Neuchâtel

Portal overviewUser guideOpen Access strategyOpen Access directive Research at UniNE Open Access ORCIDWhat's new