Decision Tree: id3 Algorithm

ID3 stands for Iterative Dichotomiser 3. It is an algorithm that is used for making a decision tree from the given datasets. It was developed by Ross Quinlan in 1983. It is the precursor of C4.5 Algorithm. The ID3 Algorithm builds the tree using a top-down greedy search approach throughout the given set to test each attribute at each tree node. 

As the name suggests the greedy algorithm makes the choice that seems to be the best at that moment. ID3 uses Entropy and Information Gain to construct a decision tree. It selects the best approach by selecting the best attribute that is maximum Information Gain(IG) or minimum Entropy(H). It is used for machine learning and natural language processing domains.

It follows Occam’s razor principle and attempts to create the smallest possible decision tree. 

id3 algorithm:

  • Step-1: It creates a root node for the tree which contains all the datasets, says S.
  • Step-2: Then find the best attribute using some selection measure which is known as Attribute Selection Mesure (ASM).
  • Step-3: Now divide S into subsets according to the selected attribute.
  • Step-4: The algorithm continues to recursively make new decision trees using the subsets of the dataset created in step -3. This process continues until the stage comes when we cannot further classify the nodes and called the final node a leaf node.

Attribute Selection Measures

Suppose we are using a decision tree, then the main issue is how to select the best attribute for the root node or different levels known as sub-nodes is a very complicated step. By just selecting the node randomly to be root this issue can't be solved If we follow a random approach, it may give us bad results with low accuracy. So to solve some problems there is a process called Attribute selection measure or ASM. Thus, by this process, we can select the best attribute for the nodes of the tree. 

ASM has two popular techniques:

  • Entropy
  • Information gain

Entropy

It is a measure of the randomness in the information being processed. The greater the entropy at a node, the less information is known about the classification of data at this stage of the tree. 

In ID3 it follows the simple rule i.e., A branch with an entropy of 0 is a pure leaf node and a branch with entropy more than zero needs further splitting.

To build a decision tree we need to calculate entropy in two types they are:

Entropy for one attribute:

Entropy(S)=-\sum_{i=1}^{n}p_{i} log_{2}(p_{i})

Entropy for multiple attributes:

Entropy(T,X)=\sum_{n\epsilon X}P(n) E(n)

Information gain

It is a statistical property that is used for measuring how well the given attribute separates the training data depending upon the target classification. It is the difference between entropy before split and average entropy after split for the given dataset based on the given attribute values. The information gain is based on the decrease in entropy after a dataset is split on an attribute

Decision Tree's construction is all about finding an attribute that returns the highest information gain and the smallest entropy. It is the most homogeneous branch.

Information Gain = Entropy (P) - Entropy (T,X)