Decision tree algorithm is one of the most popular machine learning algorithm. It is a supervised machine learning algorithm, used for both classification and regression task. It is a model that uses set of rules to classify something.
This is the PART I of Decision Tree Tutorial.
Link For PART II DECISION TREE TUTORIAL
Lets see decision tree with this simple example, It is normal “AND’ operation problem, where ‘A’, ‘B’ are features and “A and B” are corresponding labels.
If A=F then result=F
If A=T and B=T, then result=T
If A=T and B=F, then result = F
This is a an example of binary classifier. It classify “And” operation is ‘False’ or ‘True’.
This story is about understanding the full mathematical concept of Decision tree algorithm. So,lets break it!!!
Before diving deeper into the concept lets understand about impurity, types of measures of impurity. it will help us to understand the algorithm better.
Impurity
Let’s understand Impurity with the following toy example,
From above image, a ball is randomly drawn from each bowl. So how much information you needed to accurately tells the color of ball. So, left bowl needed less information as all of the ball is red colored, central bowl needed more information than left bowl to tell it accurately, and right bowl needed maximum information as both number of both color ball are same.
As information is measure of purity, so we can say that left bowl is pure node, middle is less impure and right is more impure.
So, how can we measure impurity in sample??
There are couple of impurity measures are there, but in this story we will talk about only two such measure,
- Entropy
- Gini index/ Gini impurity
Entropy
Entropy is amount of information is needed to accurately describe the some sample. So if sample is homogeneous, means all the element are similar than Entropy is 0, else if sample is equally divided than entropy is maximum 1.
So, left bowl has lowest entropy, middle bowl has more entropy and right bowl has highest entropy.
Mathematically it is written as ,
Gini index / Gini impurity
Gini index is measure of inequality in sample. It has value between 0 and 1. Gini index of value 0 means sample are perfectly homogeneous and all element are similar, whereas, Gini index of value 1 means maximal inequality among elements. It is sum of the square of the probabilities of each class. It is illustrated as,
So what is the importance of impurity measure in decision tree??
Impurity measures the homogeneity in the data sample. If the sample is homogeneous then sample are from same class.
Decision Tree Algorithm
Decision tree algorithm is a tree where nodes represents features(attributes), branch represents decision(rule) and leaf nodes represents outcomes(discrete and continuous).
So how decision tree algorithm are built actually??
There are Various algorithm that are used to generate decision tree from data, some are as following,
- Classification and regression tree CART
- ID 3
- CHAID
- ID 4.5
In this tutorial we will only talk about CART and next tutorial will explain concept of ID3 algorithm. These are mostly used in the industry.
So lets start with a generating a classification tree with the help of CART algorithm,
CART
- It is used for generating both classification tree and regression tree.
- It uses Gini index as metric/cost function to evaluate split in feature selection in case of classification tree.
- It is used for binary classification.
- It use least square as a metric to select features in case of Regression tree.
Lets start with generating classification tree.
Lets start with weather data set, which is quite famous in explaining decision tree algorithm,where target is to predict play or not( Yes or No) based on weather condition.
From data, outlook, temperature, humidity, wind are the features of data.
So, lets start building tree,
Outlook
Outlook is a nominal feature. it can take three value, sunny, overcast and rain. Lets summarize the final decision for outlook features,
Gini index (outlook=sunny)= 1-(2/5)²-(3/5)² = 1- 0.16–0.36 = 0.48
Gini index(outlook=overcast)= 1- (4/4)²-(0/4)² = 1- 1- 0 = 0
Gini index(outlook=rainfall)= 1- (3/5)² -(2/5)² = 1- 0.36- 0.16 = 0.48
Now , we will calculate the weighted sum of Gini index for outlook features,
Gini(outlook) = (5/14)*0.48 + (4/14) *0 + (5/14)*0.48 = 0.342
Temperature
Similarly, temperature is also a nominal feature, it can take three values, hot,cold and mild. lets summarize the final decision of temperature feature,
Gini(temperature=hot) = 1-(2/4)²-(2/4)² = 0.5
Gini(temperature=cool) = 1-(3/4)²-(1/4)² = 0.375
Gini(temperature=mild) = 1-(4/6)²-(2/6)² = 0.445
Now, the weighted sum of Gini index for temperature features can be calculated as,
Gini(temperature)= (4/14) *0.5 + (4/14) *0.375 + (6/14) *0.445 =0.439
Humidity
Humidity is a binary class feature , it can take two value high and normal.
Gini(humidity=high) = 1-(3/7)²-(4/7)² = 0.489
Gini(humidity=normal) = 1-(6/7)²-(1/7)² = 0.244
Now, the weighted sum of Gini index for humidity features can be calculated as,
Gini(humidity) = (7/14) *0.489 + (7/14) *0.244=0.367
Wind
wind is a binary class feature , it can take two value weak and strong.
Gini(wind=weak)= 1-(6/8)²-(2/8)² = 0.375
Gini(wind=strong)= 1-(3/6)²-(3/6)²= 0.5
Now, the weighted sum of Gini index for wind features can be calculated as,
Gini(wind) = (8/14) *0.375 + (6/14) *0.5=0.428
Decision for root node
So,the final decision of all the features,
From table, you can seen that Gini index for outlook feature is lowest. So we get our root node.
Lets calculate the Gini index on sub data set for outlook feature, as you can seen we have three sub data section, sunny, overcast, and rainfall of outlook feature. we will use same method as above to find next split.
So , lets focus on sub data on sunny outlook feature. we need to find the Gini index for temperature, humidity and wind feature respectively.
Gini index for temperature on sunny outlook
Gini(outlook=sunny & temperature=hot) = 1-(0/2)²-(2/2)² = 0
Gini(outlook=sunny & temperature=cool) = 1-(1/1)²-(0/1)² = 0
Gini(outlook=sunny & temperature=mild) = 1-(1/2)²-(1/2)² = 0.5
Now, the weighted sum of Gini index for temperature on sunny outlook features can be calculated as,
Gini(outlook=sunny & temperature)= (2/5) *0 + (1/5) *0+ (2/5) *0.5 =0.2
Gini Index for humidity on sunny outlook
Gini(outlook=sunny & humidity=high) = 1-(0/3)²-(3/3)² = 0
Gini(outlook=sunny & humidity=normal) = 1-(2/2)²-(0/2)² = 0
Now, the weighted sum of Gini index for humidity on sunny outlook features can be calculated as,
Gini(outlook = sunny & humidity) = (3/5) *0 + (2/5) *0=0
Gini Index for wind on sunny outlook
Gini(outlook=sunny & wind=weak) = 1-(1/3)²-(2/3)² = 0.44
Gini(outlook=sunny & wind=strong) = 1-(1/2)²-(1/2)² = 0.5
Now, the weighted sum of Gini index for wind on sunny outlook features can be calculated as,
Gini(outlook = sunny and wind) = (3/5) *0.44 + (2/5) *0.5=0.266+0.2= 0.466
Decision on sunny outlook factor
we have calculated the Gini index of all the features when the outlook is sunny. You can infer that humidity has lowest value. so next node will be humidity.
Now,Lets focus on sub data for overcast outlook feature.
As, you can see from the above table all the decision for overcast outlook feature is always ‘Yes’. Then Gini index for each feature is 0, means it is a leaf nodes.
Now,Lets focus on sub data for high and normal humidity feature.
From the given two table, the decision is always ‘No’ when humidity is ‘high’ and decision is always ‘Yes’ when humidity is ‘normal’. So we got leaf node. now decision tree can be viewed as,
Now,Lets focus on sub data for rainfall outlook feature. we need to find the Gini index for temperature,humidity and wind feature respectively.
Gini index for temperature for rainfall outlook
Gini(outlook=rainfall and temp.=Cool) = 1 — (1/2)2 — (1/2)2 = 0.5
Gini(outlook=rainfall and temp.=Mild) = 1 — (2/3)2 — (1/3)2 = 0.444
Gini(outlook=rainfall and temp.) = (2/5)*0.5 + (3/5)*0.444 = 0.466
Gini index for humidity for rainfall outlook
Gini(outlook=rainfall and humidity=high) = 1 — (1/2)2 — (1/2)2 = 0.5
Gini(outlook=rainfall and humidity=normal) = 1 — (2/3)2 — (1/3)2 = 0.444
Gini(Outlook=rainfall and humidity) = (2/5)*(0.5 + (3/5)*0.444 = 0.466
Gini index for wind for rainfall outlook feature
Gini(outlook=rainfall and wind=weak) = 1 — (3/3)2 — (0/3)2 = 0
Gini(outlook=rainfall and wind=strong) = 1 — (0/2)2 — (2/2)2 = 0
Gini(outlook=rainfall and wind) = (3/5)*0 + (2/5)*0 = 0
Decision on rainfall outlook factor
we have calculated the Gini index of all the features when the outlook is rainfall. You can infer that wind has lowest value. so next node will be wind.
Now,Lets focus on sub data strong and weak for wind rainfall feature.
From the above two table, the decision is always ‘No’ when wind is ‘strong’ and decision is always ‘Yes’ when wind is ‘weak’. So we got leaf node.
So, we have explained the generation of decision tree in step by step manner.
In the next tutorial we will generate tree with ID3 algorithm
That’s it. Thank you,