Ensemble Methods: Elegant Techniques to Produce Improved Machine Learning Results

Machine Learning, in computing, is where art meets science. Perfecting a machine learning tool is a lot about understanding data and choosing the right algorithm. But why choose one algorithm when you can choose many and make them all work to achieve one thing: improved results.

Last updated: May 1, 2026

Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Machine Learning, in computing, is where art meets science. Perfecting a machine learning tool is a lot about understanding data and choosing the right algorithm. But why choose one algorithm when you can choose many and make them all work to achieve one thing: improved results.

Last updated: May 1, 2026

Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
Necati Demir, PhD
17 Years of Experience

Necati holds a PhD degree in Machine Learning and has 14 years of experience in software development.

Share

Ensemble methods are techniques that train multiple models on the same problem and combine their predictions. The elegance is in the simplicity of the idea: any single model has blind spots and makes errors, but if you combine many models that fail in different ways, the errors tend to partly cancel out. You end up with a system that is more accurate and more robust than any of its components, built out of nothing more exotic than averaging, voting, or weighted sums. There is no special class of “ensemble algorithm” you have to learn; you take models you already know how to train, run them in parallel or in sequence, and let basic arithmetic do the rest.

This idea shows up everywhere in applied machine learning. Random Forest, which averages the predictions of many decision trees, is a standard first-pass model for tabular data. Gradient boosted trees (XGBoost, LightGBM, CatBoost) are sequential ensembles, and they power production systems for credit scoring at banks, fraud detection at payment networks, click-through rate prediction in online advertising, demand forecasting in retail, and risk stratification in healthcare. When small accuracy gains translate into millions of dollars or measurable differences in patient outcomes, the extra complexity of an ensemble is usually worth it.

In this post I will cover ensemble methods for classification and describe four widely used approaches: voting, stacking, bagging, and boosting. I will then close with a section on the gradient boosting frameworks (XGBoost, LightGBM, and CatBoost) that have become the workhorses of modern applied ML.

Before going further, a note on terminology. Throughout this article I use “model” to describe the output of an algorithm trained on data, which is then used to make predictions. The algorithm can be anything from logistic regression to a decision tree to a neural network. When models are used as inputs to an ensemble method, they are called “base models,” and the final combined predictor is the “ensemble model.”

Voting and Averaging Based Ensemble Methods

Voting and averaging are two of the easiest examples of ensemble learning in machine learning. They are both easy to understand and implement. Voting is used for classification and averaging is used for regression.

images.png

In both methods, the first step is to create multiple classification/regression models using some training dataset. Each base model can be created from different splits of the same training set with the same algorithm, from the same dataset with different algorithms, or by some other method. Modern scikit-learn ships a VotingClassifier (and VotingRegressor) that handles the bookkeeping for you:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import VotingClassifier
from sklearn.model_selection import train_test_split, cross_val_score


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


clf1 = LogisticRegression(max_iter=1000)
clf2 = DecisionTreeClassifier(random_state=42)
clf3 = GaussianNB()


voting_clf = VotingClassifier(
    estimators=[('lr', clf1), ('dt', clf2), ('gnb', clf3)],
    voting='hard'  # use 'soft' to average predicted probabilities
)
voting_clf.fit(X_train, y_train)
print(cross_val_score(voting_clf, X_train, y_train, cv=5).mean())

Majority Voting

Every model casts a vote for each test instance, and the final prediction is whichever class receives more than half the votes. If no class clears that threshold, you can either declare the ensemble undecided for that instance or simply pick the class with the most votes (sometimes called "plurality voting"). In scikit-learn, this corresponds to voting='hard'.

Weighted Voting

Unlike majority voting, where every model has equal say, weighted voting lets you trust some models more than others. The vote of a stronger model is counted multiple times, or equivalently, multiplied by a larger weight. In VotingClassifier you pass a weights=[2, 1, 1] argument to do exactly this. Choosing reasonable weights is up to you, and is often done by cross-validation on the validation set.

Soft Voting (the modern default)

Hard voting throws away useful information by reducing each model’s output to a single label. Soft voting averages the predicted class probabilities across models and picks the argmax. When your base models are reasonably well calibrated, soft voting is almost always preferable. Use voting='soft' in scikit-learn, but note that all base estimators must implement predict_proba.

Simple and Weighted Averaging (regression)

For regression, the equivalent of voting is averaging. Simple averaging takes the mean of all model predictions for each instance, which often reduces overfitting and produces a smoother regressor:

import numpy as np
# predictions: array of shape (n_samples, n_models)
final_predictions = predictions.mean(axis=1)

Weighted averaging multiplies each model's predictions by a weight before taking the mean:

weights = np.array([0.5, 0.3, 0.2])  # must sum to 1
final_predictions = (predictions * weights).sum(axis=1)

Scikit-learn’s VotingRegressor accepts the same weights argument as its classifier sibling.

Stacking Multiple Machine Learning Models

Stacking, also known as stacked generalization, is an ensemble method where the models are combined using another machine learning algorithm. The idea is to train base learners on the training data, generate a new dataset whose features are those base learners' predictions, and then train a second-level "meta learner" on that new dataset.

The naive version of stacking has a critical bug: if you train base models on the full training set and then ask them to predict on that same training set, the resulting predictions are overfit and the meta learner will learn from misleadingly optimistic features. The fix is to generate the meta features using out-of-fold (OOF) predictions via cross-validation. Scikit-learn’s StackingClassifier (and StackingRegressor) handles this for you automatically:

from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


estimators = [
    ('rf', RandomForestClassifier(n_estimators=200, random_state=42)),
    ('svc', make_pipeline(StandardScaler(), SVC(probability=True, random_state=42))),
]


stacking_clf = StackingClassifier(
    estimators=estimators,
    final_estimator=LogisticRegression(max_iter=1000),
    cv=5,            # 5-fold OOF predictions for meta features
    passthrough=False  # set True to also pass original features to the meta learner
)
stacking_clf.fit(X_train, y_train)
print(stacking_clf.score(X_test, y_test))

If you want to roll your own stacking pipeline (for example, to mix in models from XGBoost, LightGBM, or CatBoost that you want to tune separately), the conceptual algorithm looks like this:

from sklearn.model_selection import KFold


base_algorithms = [...]  # list of base estimators


stacking_train = np.zeros((len(y_train), len(base_algorithms)))
stacking_test = np.zeros((len(X_test), len(base_algorithms)))


kf = KFold(n_splits=10, shuffle=True, random_state=42)


for i, base_algorithm in enumerate(base_algorithms):
    test_preds_per_fold = []
    for train_ix, val_ix in kf.split(X_train):
        base_algorithm.fit(X_train[train_ix], y_train[train_ix])
        stacking_train[val_ix, i] = base_algorithm.predict(X_train[val_ix])
        test_preds_per_fold.append(base_algorithm.predict(X_test))
    # average the test predictions across folds for stability
    stacking_test[:, i] = np.mean(test_preds_per_fold, axis=0)


final_predictions = combiner_algorithm.fit(stacking_train, y_train).predic

In production use, stacking can be deepened to multiple levels, where the meta-learner's outputs themselves become inputs to a third-level model. The diminishing returns set in quickly, though, so two levels is the most common configuration in practice.

Bootstrap Aggregating (Bagging)

The name "bootstrap aggregating," shortened to "bagging," summarizes the strategy. The first step in bagging is to create multiple models, all using the same algorithm, but each trained on a different sub-sample of the data drawn with replacement (bootstrap sampling) from the original training set. Some original examples will appear more than once in a given sub-sample, and others will not appear at all. If you want a sub-dataset of m elements, you draw a random element from the original dataset m times. Generating n such datasets means repeating that procedure n times.

import numpy as np


def bootstrap_sample(X, y, m, rng):
    idx = rng.integers(0, len(X), size=m)
    return X[idx], y[idx]

The second step in bagging is to aggregate the predictions of the resulting models, typically with voting (for classification) or averaging (for regression). Scikit-learn provides BaggingClassifier and BaggingRegressor to handle the whole pipeline:

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier


bag = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    bootstrap=True,
    n_jobs=-1,           # train base estimators in parallel
    random_state=42
)
bag.fit(X_train, y_train)

Because the bootstrap samples are independent, training is embarrassingly parallel, and n_jobs=-1 will use every available CPU core.

Random Forest is a famous extension of this idea: it bags decision trees while also randomly sampling a subset of features at each split. That extra randomness decorrelates the trees, which usually outperforms plain bagging. In practice, RandomForestClassifier is what most practitioners reach for first.

images-1.png

Related: A Deep Learning Tutorial: From Perceptrons to Deep Networks

Boosting: Converting Weak Models to Strong Ones

"Boosting" is a family of algorithms that convert weak models (any model performing meaningfully better than chance, but still poor in absolute terms) into a strong combined model. Boosting builds the ensemble incrementally: each new model is trained to focus on the instances that the previous models got wrong. Unlike bagging, boosting is sequential by nature, so its base models cannot be trained in parallel (although individual base learners often can be parallelized internally).

The conceptual loop:

def boost(base_algorithm, X, y, n_rounds):
    models = []
    weights = np.ones(len(X)) / len(X)  # start uniform
    for _ in range(n_rounds):
        model = base_algorithm.fit(X, y, sample_weight=weights)
        preds = model.predict(X)
        error = compute_weighted_error(preds, y, weights)
        # increase weight of misclassified examples, decrease for correct
        weights = update_weights(weights, preds, y, error)
        models.append((model, model_weight(error)))
    return models

AdaBoost

AdaBoost (Adaptive Boosting), introduced by Yoav Freund and Robert Schapire in 1995, is the canonical boosting algorithm and the work for which they received the 2003 Gödel Prize, one of the most prestigious awards in theoretical computer science. AdaBoost typically uses shallow decision trees (often "stumps" with a single split) as its base learners. At each round, training instances misclassified by previous models receive larger weights, forcing later models to focus on the harder cases. Final predictions are aggregated using weighted voting, where each model's vote is weighted by its accuracy.

In scikit-learn:

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier


ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # decision stumps
    n_estimators=200,
    learning_rate=1.0,
    random_state=42
)
ada.fit(X_train, y_train)

Note that the parameter is estimator (it was renamed from base_estimator in scikit-learn 1.2 and the old name was removed in 1.4).

Gradient Boosting and the Modern Frameworks

AdaBoost was the breakthrough, but in modern practice it has been largely supplanted by gradient boosting. Gradient boosting generalizes AdaBoost: instead of reweighting examples, each new tree is trained to predict the residual errors (more precisely, the negative gradient of a loss function) of the current ensemble. This formulation, due to Jerome Friedman in 1999, lets you optimize any differentiable loss, which is enormously more flexible.

Three open-source implementations have come to dominate applied ML for tabular data, and any modern overview of ensembles must cover them.

XGBoost (Extreme Gradient Boosting)

Released by Tianqi Chen in 2014, was the framework that triggered the gradient boosting boom. It introduced a regularized objective, efficient sparse-aware split finding, and excellent parallelization.

import xgboost as xgb
model = xgb.XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric='logloss',
    tree_method='hist',     # fast histogram-based algorithm
    device='cuda',          # set 'cpu' if no GPU available
    random_state=42
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=False)

LightGBM

Released by Microsoft in 2017, builds on the same gradient boosting foundation but introduces leaf-wise tree growth, histogram-based feature binning, gradient-based one-side sampling (GOSS), and exclusive feature bundling (EFB). The result is dramatically faster training and lower memory use, especially on large datasets. It is typically the fastest of the three on CPU.

import lightgbm as lgb
model = lgb.LGBMClassifier(
    n_estimators=500,
    learning_rate=0.05,
    num_leaves=63,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)
model.fit(X_train, y_train, eval_set=[(X_test, y_test)])

CatBoost

Open-sourced by Yandex in July 2017, specializes in handling categorical features natively (no manual one-hot or target encoding required). It uses ordered boosting and ordered target statistics to mitigate the prediction-shift problem that plagues naive target encoding, and grows symmetric (oblivious) trees that act as built-in regularization. CatBoost is famous for performing well out-of-the-box with minimal tuning.

from catboost import CatBoostClassifier
model = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    cat_features=cat_cols,        # pass categorical column names directly
    eval_metric='Logloss',
    verbose=False,
    random_seed=42
)
model.fit(X_train, y_train, eval_set=(X_test, y_test))

Which one should you choose? Rough rules of thumb from current practice:

  • CatBoost when you have meaningful categorical features or want strong defaults with minimal tuning.
  • LightGBM when speed and memory matter, especially on very large datasets.
  • XGBoost when you want fine-grained control and the most mature ecosystem (custom loss functions, deployment paths, hardware support).

In serious applied work, top solutions almost always train all three (often with multiple random seeds and hyperparameter configurations each) and stack their out-of-fold predictions with a simple meta learner like ridge regression or logistic regression. The diversity between the three frameworks tends to produce a real boost over any single one.

Scikit-learn also includes its own HistGradientBoostingClassifier and HistGradientBoostingRegressor, which are histogram-based gradient boosters in the LightGBM family. These are convenient when you want to stay in pure scikit-learn without an extra dependency, and they have improved substantially in recent versions.

Conclusion

Ensemble methods are one of the most reliably effective techniques in the applied machine learning toolkit. They routinely deliver the last few percentage points of accuracy that matter most in production systems, where small accuracy gains can translate into millions of dollars (in advertising or finance) or into measurable improvements in patient outcomes (in healthcare).

That accuracy comes at a cost. Ensembles are harder to interpret than single models, harder to deploy and maintain, and slower at inference time. In regulated industries where a model must be auditable and explainable to a customer, regulator, or doctor, a single well-tuned model (or a constrained interpretable ensemble like a small random forest) is sometimes the better choice. SHAP values and similar tools have helped close this interpretability gap, but they do not eliminate it.

The decision is not all-or-nothing. Many production systems use a single model in the live serving path while ensembles are used during model selection, in offline scoring, or as teacher models for distillation into smaller students. Knowing how to construct, evaluate, and combine ensembles remains a core competency for any practicing machine learning engineer, even when the final model that ships is a single tree.

Hire a Toptal expert on this topic.
Hire Now
Necati Demir, PhD

Necati Demir, PhD

17 Years of Experience

Summit, NJ, United States

Member since November 17, 2015

About the author

Necati holds a PhD degree in Machine Learning and has 14 years of experience in software development.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy.

World-class articles, delivered weekly.

By entering your email, you are agreeing to our privacy policy.

Join the Toptal® community.