Math behind Decision Tree Algorithm

Decision tree algorithm is one of the most popular machine learning algorithm. It is a supervised machine learning algorithm, used for both classification and regression task. It is a model that uses set of rules to classify something.

This is the PART I of Decision Tree Tutorial.

Link For PART II DECISION TREE TUTORIAL

Lets see decision tree with this simple example, It is normal “AND’ operation problem, where ‘A’, ‘B’ are features and “A and B” are corresponding labels.

Source Hackerearth

If A=F then result=F

If A=T and B=T, then result=T

If A=T and B=F, then result = F

This is a an example of binary classifier. It classify “And” operation is ‘False’ or ‘True’.

This story is about understanding the full mathematical concept of Decision tree algorithm. So,lets break it!!!

Before diving deeper into the concept lets understand about impurity, types of measures of impurity. it will help us to understand the algorithm better.

Let’s understand Impurity with the following toy example,

source Medium

From above image, a ball is randomly drawn from each bowl. So how much information you needed to accurately tells the color of ball. So, left bowl needed less information as all of the ball is red colored, central bowl needed more information than left bowl to tell it accurately, and right bowl needed maximum information as both number of both color ball are same.

As information is measure of purity, so we can say that left bowl is pure node, middle is less impure and right is more impure.

So, how can we measure impurity in sample??

There are couple of impurity measures are there, but in this story we will talk about only two such measure,

  1. Entropy
  2. Gini index/ Gini impurity

Entropy is amount of information is needed to accurately describe the some sample. So if sample is homogeneous, means all the element are similar than Entropy is 0, else if sample is equally divided than entropy is maximum 1.

So, left bowl has lowest entropy, middle bowl has more entropy and right bowl has highest entropy.

Mathematically it is written as ,

Gini index is measure of inequality in sample. It has value between 0 and 1. Gini index of value 0 means sample are perfectly homogeneous and all element are similar, whereas, Gini index of value 1 means maximal inequality among elements. It is sum of the square of the probabilities of each class. It is illustrated as,

i is number of classes

So what is the importance of impurity measure in decision tree??

Impurity measures the homogeneity in the data sample. If the sample is homogeneous then sample are from same class.

Decision tree algorithm is a tree where nodes represents features(attributes), branch represents decision(rule) and leaf nodes represents outcomes(discrete and continuous).

So how decision tree algorithm are built actually??

There are Various algorithm that are used to generate decision tree from data, some are as following,

  1. Classification and regression tree CART
  2. ID 3
  3. CHAID
  4. ID 4.5

In this tutorial we will only talk about CART and next tutorial will explain concept of ID3 algorithm. These are mostly used in the industry.

So lets start with a generating a classification tree with the help of CART algorithm,

  1. It is used for generating both classification tree and regression tree.
  2. It uses Gini index as metric/cost function to evaluate split in feature selection in case of classification tree.
  3. It is used for binary classification.
  4. It use least square as a metric to select features in case of Regression tree.

Lets start with generating classification tree.

Lets start with weather data set, which is quite famous in explaining decision tree algorithm,where target is to predict play or not( Yes or No) based on weather condition.

From data, outlook, temperature, humidity, wind are the features of data.

So, lets start building tree,

Outlook is a nominal feature. it can take three value, sunny, overcast and rain. Lets summarize the final decision for outlook features,

Gini index (outlook=sunny)= 1-(2/5)²-(3/5)² = 1- 0.16–0.36 = 0.48

Gini index(outlook=overcast)= 1- (4/4)²-(0/4)² = 1- 1- 0 = 0

Gini index(outlook=rainfall)= 1- (3/5)² -(2/5)² = 1- 0.36- 0.16 = 0.48

Now , we will calculate the weighted sum of Gini index for outlook features,

Gini(outlook) = (5/14)*0.48 + (4/14) *0 + (5/14)*0.48 = 0.342

Similarly, temperature is also a nominal feature, it can take three values, hot,cold and mild. lets summarize the final decision of temperature feature,

Gini(temperature=hot) = 1-(2/4)²-(2/4)² = 0.5

Gini(temperature=cool) = 1-(3/4)²-(1/4)² = 0.375

Gini(temperature=mild) = 1-(4/6)²-(2/6)² = 0.445

Now, the weighted sum of Gini index for temperature features can be calculated as,

Gini(temperature)= (4/14) *0.5 + (4/14) *0.375 + (6/14) *0.445 =0.439

Humidity is a binary class feature , it can take two value high and normal.

Gini(humidity=high) = 1-(3/7)²-(4/7)² = 0.489

Gini(humidity=normal) = 1-(6/7)²-(1/7)² = 0.244

Now, the weighted sum of Gini index for humidity features can be calculated as,

Gini(humidity) = (7/14) *0.489 + (7/14) *0.244=0.367

wind is a binary class feature , it can take two value weak and strong.

Gini(wind=weak)= 1-(6/8)²-(2/8)² = 0.375

Gini(wind=strong)= 1-(3/6)²-(3/6)²= 0.5

Now, the weighted sum of Gini index for wind features can be calculated as,

Gini(wind) = (8/14) *0.375 + (6/14) *0.5=0.428

So,the final decision of all the features,

From table, you can seen that Gini index for outlook feature is lowest. So we get our root node.

Lets calculate the Gini index on sub data set for outlook feature, as you can seen we have three sub data section, sunny, overcast, and rainfall of outlook feature. we will use same method as above to find next split.

So , lets focus on sub data on sunny outlook feature. we need to find the Gini index for temperature, humidity and wind feature respectively.

Gini(outlook=sunny & temperature=hot) = 1-(0/2)²-(2/2)² = 0

Gini(outlook=sunny & temperature=cool) = 1-(1/1)²-(0/1)² = 0

Gini(outlook=sunny & temperature=mild) = 1-(1/2)²-(1/2)² = 0.5

Now, the weighted sum of Gini index for temperature on sunny outlook features can be calculated as,

Gini(outlook=sunny & temperature)= (2/5) *0 + (1/5) *0+ (2/5) *0.5 =0.2

Gini(outlook=sunny & humidity=high) = 1-(0/3)²-(3/3)² = 0

Gini(outlook=sunny & humidity=normal) = 1-(2/2)²-(0/2)² = 0

Now, the weighted sum of Gini index for humidity on sunny outlook features can be calculated as,

Gini(outlook = sunny & humidity) = (3/5) *0 + (2/5) *0=0

Gini(outlook=sunny & wind=weak) = 1-(1/3)²-(2/3)² = 0.44

Gini(outlook=sunny & wind=strong) = 1-(1/2)²-(1/2)² = 0.5

Now, the weighted sum of Gini index for wind on sunny outlook features can be calculated as,

Gini(outlook = sunny and wind) = (3/5) *0.44 + (2/5) *0.5=0.266+0.2= 0.466

we have calculated the Gini index of all the features when the outlook is sunny. You can infer that humidity has lowest value. so next node will be humidity.

Now,Lets focus on sub data for overcast outlook feature.

As, you can see from the above table all the decision for overcast outlook feature is always ‘Yes’. Then Gini index for each feature is 0, means it is a leaf nodes.

Now,Lets focus on sub data for high and normal humidity feature.

From the given two table, the decision is always ‘No’ when humidity is ‘high’ and decision is always ‘Yes’ when humidity is ‘normal’. So we got leaf node. now decision tree can be viewed as,

Now,Lets focus on sub data for rainfall outlook feature. we need to find the Gini index for temperature,humidity and wind feature respectively.

Gini(outlook=rainfall and temp.=Cool) = 1 — (1/2)2 — (1/2)2 = 0.5

Gini(outlook=rainfall and temp.=Mild) = 1 — (2/3)2 — (1/3)2 = 0.444

Gini(outlook=rainfall and temp.) = (2/5)*0.5 + (3/5)*0.444 = 0.466

Gini(outlook=rainfall and humidity=high) = 1 — (1/2)2 — (1/2)2 = 0.5

Gini(outlook=rainfall and humidity=normal) = 1 — (2/3)2 — (1/3)2 = 0.444

Gini(Outlook=rainfall and humidity) = (2/5)*(0.5 + (3/5)*0.444 = 0.466

Gini(outlook=rainfall and wind=weak) = 1 — (3/3)2 — (0/3)2 = 0

Gini(outlook=rainfall and wind=strong) = 1 — (0/2)2 — (2/2)2 = 0

Gini(outlook=rainfall and wind) = (3/5)*0 + (2/5)*0 = 0

we have calculated the Gini index of all the features when the outlook is rainfall. You can infer that wind has lowest value. so next node will be wind.

Now,Lets focus on sub data strong and weak for wind rainfall feature.

From the above two table, the decision is always ‘No’ when wind is ‘strong’ and decision is always ‘Yes’ when wind is ‘weak’. So we got leaf node.

So, we have explained the generation of decision tree in step by step manner.

In the next tutorial we will generate tree with ID3 algorithm

That’s it. Thank you,

Machine learning | Deep Learning | Reinforcement Learning | Probability