Ensemble learning

Rahul Kumar
5 min readJan 6, 2019

Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem. In contrast to ordinary machine learning approaches which try to learn one hypothesis from training data, ensemble methods try to construct a set of hypotheses and combine them to use.

Ensemble learning comprises of two phases :

1) Multiple Learners

2) Combine the Decisions

***Multiple Learners : This is simply generation of group of base learners and these base learners should be different from each others in terms of below ways:

a) Different Algorithm :Feed training example to different learners for example decision tree, svm,Neural Networks etc.

b) Different Hyper-Parameter : In this case we use learners with similar algorithm but makes them different in terms of hyper-parameters supplied to them.example: for Neural network we can decide topology by setting up no of hidden layers, no of nodes in hidden layers. for decision trees we can come up with different strategy for splitting the attributes at nodes like information gain or gini index.

c) Different Training Sets : Learning algorithm may have high variance so when fed upon different subset of data generate different models, That’s why while feeding different subset of data to algorithm of having high variance we can generate distinct models.

***Combine the Decisions :

it involves the mechanism of how learners are combined, below are some ways:

a) Unweighted Voting: All learners are equally weighted or not weighted at all.

b) Weighted Voting : Learners are weighted based on different approach or metric like accuracy or variance.

Why Ensemble Performs Well:

As we are aware of Bias-Variance trade-off which states that low is your bias high is your variance and vice-versa. But our objective is always to have a model having low bias as well as low variance. Ensemble is one of the techniques through which we can met our objective.

To Make Ensemble Efficient we need to focus to make errors or outputs of learners to be independent of each other.

The bias-variance decomposition is often used in studying the performance of ensemble methods .It is known that Bagging can significantly reduce the variance, and therefore it is better to be applied to learners suffered from large variance, e.g., unstable learners such as decision trees or neural networks. Boosting can significantly reduce the bias in addition to reducing the variance, and therefore, on weak learners such as decision stumps, Boosting is usually more effective.

For example: Let’s take two class problem out of n learners , n1 learners saying class1 and n2 learners saying class2 and let’s assume n1>n2 then Probability(class1) = 1 — Binomial (n,n2,accuracy)

Architecture Design :

Methods For Ensemble:

1)Bagging:

We already saw that in order to use ensemble of learners and be able to combine the learners to get better accuracy we need to have set of independent learners.

So in order to make the learners independent in bagging we normally bootstrap the data i.e create different random sample subset of data from original data-set with replacements.Here we do not use the different disjoint subset of data-set or partitioned data-sets due to fact that it reduces the training sample size for learner which will not generalize well.

In order to achieve independence of output we need to have unstable learners i.e learners with high variance for e.g decision trees or Neural network.

From a sample of size n

Probability of a instance to be selected in random sample of size n = (1–1/n) ^n

So these bootstrap data sets are then fed to learner to create different distinct models and then combined using voting mechanism to yield final result.

Example: Random Forests

2) Boosting:

  • It’s an iterative procedure which adaptively change distribution of training data.

— Initially all N records are assigned equal weights, Weights change at the end of boosting round.

  • On Each Iteration t:

— Weight training example by how incorrectly it was classified

— Learn a hypothesis ht

— Strength of this hypothesis : alpha(t)

  • Final Classifier :

— A linear combination of votes of different classifier weighted by their strengths

  • Weak learners:

— P(correct)>50% but not necessarily much better.

One of basic Algorithm is Ada Boost:

Let’s see the mechanism:

Calculation of Alpha(t):

Alpha(t) is calculated to minimize the error of the respective learners.

Error is nothing but the weighted sum of probability of the instance and number of miss classification.

Once Alpha(t) is calculated we can perform voting mechanism by weighting through these alpha value of all different T learners in this case to get the desired output H(X)

Applications :Ensemble learning has already been used in diverse applications such as optical character recognition, text categorization, face recognition, computer-aided medical diagnosis, gene expression analysis, etc. Actually, ensemble learning can be used wherever machine learning techniques can be used.

Summary: Ensemble learning is a powerful machine learning paradigm which has exhibited apparent advantages in many applications. By using multiple learners, the generalization ability of an ensemble can be much better than that of a single learner. A serious deficiency of current ensemble methods is the lack of comprehensibility, i.e., the knowledge learned by ensembles is not understandable to the user.

Thank You

--

--