Gradient Boosting Algorithm
Although Gradient Boosting Algorithm proves itself worthy in various Classification and Regression task, but it is still a black-box to many people(myself) , so lets break it,
In this blog spot, I assume you have knowledge of Ensemble learning, boosting and gradient descent.
Gradient boosting algorithm is an ensemble learning algorithm. It is based on strong theoretical concept of sequentially combining weak predictor to build strong predictor. Here, predictor can be any machine learning algorithm like SVM, Logistic regression, KNN , Decision tree etc. But Decision tree version of gradient boosting is much popular then any above mentioned algorithm. So in this blog spot also we will use Decision tree as predictor. Also, commonly known as Gradient boosting tree algorithm(GBDT).
Gradient Boosting
It is a supervised learning algorithm where strong predictors is build in additive or sequential manner using weak predictor typically Decision tree. It is used for both classification and regression task.
So How Gradient Boosting is able to do so????
Gradient Boosting Algorithm based on the theory of additive modelling, And it helps the algorithm to generalize well on input data.
Before diving deeper into the world of Gradient boosting, it is important to understand the concept of additive modeling and how actually additive modelling helps to generalize well???
Additive Modelling!!!
Gradient boosting is an additive modelling algorithm. Mathematically , the concept of additive modelling is to add bunch of simple function to approximate a complex function. Here, that approximated function is nothing but it is a model, which map set of input features to the target variable.
lets understand the concept of additive modelling with an example, Consider the following curve and you are told to approximate the curve analytically.
So what will be your first few step in guessing the equation of curve: If you are smart then these will be your guess:
- Intercept is 30.
- Slope between two gray dot is -1.
- A periodic signal may be sin(x ).
so, your guess will be, f(x)=x +sin(x)+30,
Fig 1. — f(x)=30
Fig 2. — f(x)=x+30
Fig 3. — f(x)=x+sinx+30
But above example was too easy to guess, what if we can’t analytically guess the equation from curve, then how do we do that?
Lets understand with the same example step by step,
- Let’s take a random guess function f⁰(x_0)=10, After plotting we will see our first guess is not accurate. So, what more we need to do is calculated in terms of loss i.e. distance between actual and target expectation.
- So, we will calculate the loss as L=f(x)-f¹(x_0)=20+x+sin(x).
- So, we have to define an another function f¹ that can guess the curve of the loss L. So, the other function can be f¹(L)=20.
- So, combining f⁰ and f¹ we get, f^c =20+10=30. After plotting we will see our guess is still not accurate.
- Again, repeating operation in (2), we will calculate the loss as L=f(x)- f^c (x_0)=x+sin(x).
- Again repeating (3), we have to define an another function f² that can guess curve of the loss L. So, the other function will be f²(L)=x.
- So, combining f⁰, f¹ and f² we get, f^c =20+10+x=x+30. After plotting we will see our first guess is not accurate.
- Again, repeating operation in (2), we will calculate the loss as L=f(x)- f^c =sin(x).
- Again repeating (3), we have to define an another function f³ that can guess curve of the loss L. So, the other function will be f³(L)=sin(x)
- So, combining f⁰, f¹,f² and f³ we get, f^c=20+10+x+sin(x)=sin(x)+x+30. After plotting we will see our guess is correct now.
Decomposition of complex function into simple sub-function is similar to divide and conquer strategy. So, from above example function f is divided into f¹,f²,f³ sub-function.
Mathematically, In additive modelling a function f can be represented as,
f=f⁰+f¹+f²=sigma(f⁰+f¹+f²)
As in machine learning world we are given an data points instead of continuous function.The goal is to draw a nice curve through data points to generalize well. Adding a bunch of sub-function create complex function which can generalize/models input data points well, is commonly known as additive modelling. So, Gradient boosting uses additive learning to gradually nudge an approximate model towards a really good model,by adding simple model to composite model.
Algorithm for Gradient boosting:
Step 1: Find a minimum value at which your target variable will have minimum loss.So, the value will be nothing but the average value of your target variable. It can be written as, Fm_predicted
when m=0
F0_predicted(x)=mean(y)
Step 2: Calculate the cost function and its corresponding gradient g.
L=(y-Fm_predicted(x))²
g=(y-Fm_predicted(x))
when m=0
L=(y-F0_predicted(x))²
g=(y-F0_predicted(x))
Step 3: Take a decision tree model H and feed the gradient g as input variable and predict the target variable v.
H=Decision_tree_model(g)
Step 4: add the model H, weighted with alpha α.
F1_predicted=Fm_predicted + α*H
when m=0
F1_predicted=F0_predicted + α*H
Step 5: m=m+1
Step 6: After repeating step 2,3, & 4, we will get F2_predicted
Step 7 : Again, After repeating step 2,3, & 4, we will get F3_predicted and so on.
here m is number of weak predictor you want to choose. The value of m is very important in controlling over-fitting,it is decided by cross-validation.
So, F(x) = F⁰_predicted(x) + sigma ( F^m_predicted(x) ) from m=1 to number of predictors;
I hope this blog spot will able to explain the length and breadth of Gradient boosting Algorithm. Still there lot to explain about boosting and efficient boosting techniques, those will cover in my future article. STAY TUNED!!!
Thank You!!