Understanding XGBoost: A Deep Dive into the Mathematics and From-Scratch Implementation

August 6, 2025

Note: This post is initially composed with deep research and finalized by me.

Not only from my own experience in forecasting and predictive modeling in agriculture, automotive, and e-commerce industries, XGBoost (Extreme Gradient Boosting) is one of the go-to algorithms in many production-grade solutions, renowned for its performance in structured data tasks and competition-winning results. While widely used, its inner mechanics—especially the interplay between second-order optimization, regularization, and tree construction—are often treated as a black box for many ML peers.

In this post, I deconstruct XGBoost from first principles, deriving its core objective function, explaining its structural and weight-based regularization, and implementing a simplified version of the exact greedy algorithm in pure Python using only the standard library. This is intended for readers with a strong mathematical background who seek not just intuition, but rigorous understanding.

1. What Is XGBoost? The Additive Model Framework

XGBoost builds predictions using an ensemble of $K$ regression trees $f_k$ , combined additively:

\[\hat{y}_i = \sum_{k=1}^K f_k(x_i),\]

where:

$x_i \in \mathbb{R}^m$ is the $i$ -th input instance (a data point, like a customer record or house listing),
$\hat{y}_i$ is the predicted output (e.g., price, probability),
Each $f_k$ is a decision tree that maps inputs to real-valued scores.

At each boosting round $t$ , the model updates its prediction:

\[\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + f_t(x_i).\]

The new tree $f_t$ is trained to correct the errors (residuals) of the previous model.

🔍 What Is an Instance?

An instance is simply a single row in your dataset—a specific example.

For example, if you’re predicting house prices:

Instance 1: [bedrooms=3, sqft=2000, year=2005] → price = $400,000
Instance 2: [bedrooms=2, sqft=1200, year=1998] → price = $250,000

So if you have $n$ rows, you have $n$ instances: $x_1, x_2, ..., x_n$ . Each $x_i \in \mathbb{R}^m$ is a vector of $m$ numerical or encoded features.

🔍 What Is a Leaf in a Decision Tree?

A decision tree works like a flowchart. It asks a series of yes/no questions (e.g., “Is size > 1500?”) and routes each instance down a path until it reaches a terminal node, called a leaf.

Each leaf acts like a “bucket” that collects similar instances and assigns them a prediction value—called the leaf weight, $w_j$ .

If a tree has $T$ leaves, then every instance ends up in exactly one leaf.

👉 So the tree doesn’t give a unique output for every instance—it groups similar instances and gives them the same score.

This is why decision trees are piecewise constant functions: they partition the input space and assign a constant value to each region.

2. The XGBoost Objective: Balancing Fit and Complexity

The goal at step $t$ is to learn $f_t$ that minimizes the total objective:

\[\text{obj}^{(t)} = \sum_{i=1}^n l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) + \Omega(f_t),\]

where:

$l(\cdot)$ : differentiable convex loss (e.g., squared error, logistic loss),
$\Omega(f_t)$ : regularization term penalizing tree complexity.

XGBoost uses:

\[\Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2,\]

with:

$T$ : number of leaves,
$w_j$ : score (leaf weight) at leaf $j$ ,
$\gamma \geq 0$ : penalty per leaf (controls tree depth),
$\lambda \geq 0$ : L2 penalty on leaf weights (controls overfitting).

This dual penalty discourages both deep trees (via $\gamma$ ) and large leaf outputs (via $\lambda$ ).

3. Second-Order Taylor Expansion: Making the Objective Tractable

Since $f_t(x_i)$ is piecewise constant and non-differentiable, we can’t directly optimize the objective. Instead, XGBoost uses a second-order Taylor expansion of the loss $l$ around the current prediction $\hat{y}_i^{(t-1)}$ :

\[l(y_i, \hat{y}_i^{(t-1)} + f_t(x_i)) \approx l(y_i, \hat{y}_i^{(t-1)}) + g_i f_t(x_i) + \frac{1}{2} h_i f_t(x_i)^2,\]

where:

$g_i = \partial_{\hat{y}} l(y_i, \hat{y}_i^{(t-1)})$ : first derivative (gradient),
$h_i = \partial^2_{\hat{y}} l(y_i, \hat{y}_i^{(t-1)})$ : second derivative (Hessian).

Dropping constants, the working objective becomes:

\[\text{obj}^{(t)} \approx \sum_{i=1}^n \left[ g_i f_t(x_i) + \frac{1}{2} h_i f_t(x_i)^2 \right] + \Omega(f_t).\]

This is now a quadratic function in $f_t(x_i)$ , which we can optimize efficiently.

4. Tree Structure and Leaf-Wise Optimization

Let’s define:

$q: \mathbb{R}^m \to \{1,\dots,T\}$ : the tree structure, a function that assigns each input $x_i$ to a leaf $j$ .
$I_j = \{ i \mid q(x_i) = j \}$ : the set of instances that fall into leaf $j$ .

Then, by definition: $f_t(x_i) = w_j \quad \text{for all } i \in I_j.$

That is, all instances in the same leaf get the same prediction from this tree: $w_j$ .

🧮 Rewriting the Objective by Leaves

Instead of summing over instances one by one, we can group them by leaf:

\[\sum_{i=1}^n g_i f_t(x_i) = \sum_{j=1}^T \sum_{i \in I_j} g_i w_j = \sum_{j=1}^T \left( \sum_{i \in I_j} g_i \right) w_j\] \[\sum_{i=1}^n \frac{1}{2} h_i f_t(x_i)^2 = \sum_{j=1}^T \sum_{i \in I_j} \frac{1}{2} h_i w_j^2 = \sum_{j=1}^T \frac{1}{2} \left( \sum_{i \in I_j} h_i \right) w_j^2\]

Now include the regularization $\Omega(f_t) = \gamma T + \frac{1}{2} \lambda \sum_{j=1}^T w_j^2$ , and combine:

\[\text{obj}^{(t)} = \sum_{j=1}^T \left[ \left( \sum_{i \in I_j} g_i \right) w_j + \frac{1}{2} \left( \sum_{i \in I_j} h_i + \lambda \right) w_j^2 \right] + \gamma T\]

Let:

$G_j = \sum_{i \in I_j} g_i$ : total gradient in leaf $j$ ,
$H_j = \sum_{i \in I_j} h_i$ : total Hessian in leaf $j$ .

Then: $\text{obj}^{(t)} = \sum_{j=1}^T \left[ G_j w_j + \frac{1}{2} (H_j + \lambda) w_j^2 \right] + \gamma T$

✅ Optimal Leaf Weight

This expression is a sum of independent quadratics in $w_j$ , so we can minimize each term separately.

Take derivative w.r.t. $w_j$ , set to zero:

\[G_j + (H_j + \lambda) w_j = 0 \quad \Rightarrow \quad w_j^* = -\frac{G_j}{H_j + \lambda}\]

This is the best possible value for the leaf score $w_j$ , given which instances are in that leaf.

📉 Optimal Objective Value

Substitute $w_j^*$ back in:

\[\text{obj}^* = -\frac{1}{2} \sum_{j=1}^T \frac{G_j^2}{H_j + \lambda} + \gamma T\]

This value depends only on how the tree partitions the data (i.e., which $I_j$ sets are formed). It serves as a score for evaluating tree quality: lower is better.

5. Split Evaluation: The Gain Formula

To grow the tree, we ask: “Should we split this node?” and “How?”

Suppose a node contains instance set $I$ , and we consider splitting it into left ($I_L$ ) and right ($I_R$ ) children.

Let:

$G = \sum_{i \in I} g_i$ , $H = \sum_{i \in I} h_i$ ,
$G_L = \sum_{i \in I_L} g_i$ , $H_L = \sum_{i \in I_L} h_i$ ,
$G_R = G - G_L$ , $H_R = H - H_L$ .

The gain from this split is the reduction in objective:

\[\text{Gain} = \underbrace{\left[ -\frac{1}{2} \frac{G_L^2}{H_L + \lambda} -\frac{1}{2} \frac{G_R^2}{H_R + \lambda} \right]}_{\text{after split}} - \underbrace{\left[ -\frac{1}{2} \frac{G^2}{H + \lambda} \right]}_{\text{before split}} - \gamma\]

Simplifying:

\[\text{Gain} = \frac{1}{2} \left[ \frac{G_L^2}{H_L + \lambda} + \frac{G_R^2}{H_R + \lambda} - \frac{G^2}{H + \lambda} \right] - \gamma\]

🧠 Interpretation

The first two terms: improvement from having two specialized leaves.
The third term: loss from the original (unsplit) leaf.
$-\gamma$ : penalty for adding a new leaf.

✅ A split is accepted only if $\text{Gain} > 0$ . This acts as pre-pruning: only splits that improve the regularized objective are made.

🔥 Note: The original paper uses $-\gamma$ , not $-\gamma/2$ . Some implementations mistakenly halve the penalty, weakening regularization.

6. The Exact Greedy Algorithm: How Splits Are Found

The exact greedy algorithm finds the best split by:

For each feature, sort instances by feature value.
For each adjacent pair $(x_i, x_{i+1})$ , consider a candidate split at $\frac{x_i + x_{i+1}}{2}$ .
For each candidate, compute the gain.
Choose the split with maximum gain (if positive).

This ensures all possible binary splits are evaluated.

🌲 Handling Ordinal Features

XGBoost treats integer-encoded ordinal features (e.g., education level 1–8) as continuous. It generates splits at midpoints like 1.5, 2.5, etc., enforcing monotonic partitions. This is efficient but assumes monotonicity—non-monotonic relationships may be missed.

⏱️ Computational Complexity

Per node: $O(d \cdot n \log n)$ , due to sorting $d$ features. This is expensive for large datasets, motivating approximate methods (e.g., quantile sketching).

🛑 Split Validity: `min_child_weight`

A split is invalid if: $H_L < \text{min\_child\_weight} \quad \text{or} \quad H_R < \text{min\_child\_weight}.$

This ensures sufficient data and curvature in each child. In classification, $h_i = p_i(1 - p_i) \to 0$ near decision boundaries, so this prevents overconfident splits on small or certain groups.

7. From-Scratch Python Implementation (Standard Library Only)

Here’s a minimal, pure-Python implementation of the exact greedy algorithm.

from typing import List, Tuple, Optional

class TreeNode:
    def __init__(self):
        self.left: Optional['TreeNode'] = None
        self.right: Optional['TreeNode'] = None
        self.feature: Optional[int] = None
        self.threshold: Optional[float] = None
        self.value: Optional[float] = None  # leaf output

class TreeBooster:
    def __init__(self, max_depth: int = 3, min_child_weight: float = 1.0,
                 gamma: float = 0.0, reg_lambda: float = 1.0):
        self.max_depth = max_depth
        self.min_child_weight = min_child_weight
        self.gamma = gamma
        self.reg_lambda = reg_lambda
        self.root: Optional[TreeNode] = None

    def _compute_gradients(self, y_true: float, y_pred: float) -> Tuple[float, float]:
        # Example: squared error loss
        grad = y_pred - y_true
        hess = 1.0  # hessian = 1 for squared loss
        return grad, hess

    def _find_better_split(self, X: List[List[float]], y_true: List[float], y_pred: List[float],
                           idxs: List[int]) -> Tuple[Optional[int], Optional[float], Optional[float]]:
        best_gain = -float('inf')
        best_feature = None
        best_threshold = None
        best_value = None

        # Compute total G and H for current node
        G = sum(self._compute_gradients(y_true[i], y_pred[i])[0] for i in idxs)
        H = sum(self._compute_gradients(y_true[i], y_pred[i])[1] for i in idxs)

        if H < self.min_child_weight:
            return None, None, None

        n_features = len(X[0])
        for feature_idx in range(n_features):
            # Sort indices by feature value
            sorted_idxs = sorted(idxs, key=lambda i: X[i][feature_idx])
            values = [X[i][feature_idx] for i in sorted_idxs]

            GL, HL = 0.0, 0.0
            GR, HR = G, H

            for i in range(len(sorted_idxs) - 1):
                idx = sorted_idxs[i]
                g, h = self._compute_gradients(y_true[idx], y_pred[idx])
                GL += g; HL += h
                GR -= g; HR -= h

                # Avoid duplicate values
                if values[i] == values[i + 1]:
                    continue

                threshold = (values[i] + values[i + 1]) / 2.0

                # Check child weight
                if HL < self.min_child_weight or HR < self.min_child_weight:
                    continue

                gain = 0.5 * (
                    (GL**2 / (HL + self.reg_lambda)) +
                    (GR**2 / (HR + self.reg_lambda)) -
                    ((GL + GR)**2 / (HL + HR + self.reg_lambda))
                ) - self.gamma

                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold
                    best_value = -GL / (HL + self.reg_lambda)

        if best_gain <= 0:
            return None, None, None
        return best_feature, best_threshold, best_value

    def _build_tree(self, X: List[List[float]], y_true: List[float], y_pred: List[float],
                    idxs: List[int], depth: int) -> TreeNode:
        node = TreeNode()

        if depth >= self.max_depth:
            G = sum(self._compute_gradients(y_true[i], y_pred[i])[0] for i in idxs)
            H = sum(self._compute_gradients(y_true[i], y_pred[i])[1] for i in idxs)
            node.value = -G / (H + self.reg_lambda)
            return node

        feature, threshold, value = self._find_better_split(X, y_true, y_pred, idxs)

        if feature is None:
            G = sum(self._compute_gradients(y_true[i], y_pred[i])[0] for i in idxs)
            H = sum(self._compute_gradients(y_true[i], y_pred[i])[1] for i in idxs)
            node.value = -G / (H + self.reg_lambda)
            return node

        node.feature = feature
        node.threshold = threshold

        left_idxs = [i for i in idxs if X[i][feature] <= threshold]
        right_idxs = [i for i in idxs if X[i][feature] > threshold]

        node.left = self._build_tree(X, y_true, y_pred, left_idxs, depth + 1)
        node.right = self._build_tree(X, y_true, y_pred, right_idxs, depth + 1)

        return node

    def fit(self, X: List[List[float]], y: List[float], base_pred: List[float]):
        idxs = list(range(len(X)))
        self.root = self._build_tree(X, y, base_pred, idxs, 0)

    def _predict_row(self, row: List[float]) -> float:
        node = self.root
        while node.value is None:
            if row[node.feature] <= node.threshold:
                node = node.left
            else:
                node = node.right
        return node.value

    def predict(self, X: List[List[float]]) -> List[float]:
        return [self._predict_row(row) for row in X]

✅ Example Usage

X = [[1.0], [2.0], [3.0], [4.0]]
y = [2.0, 4.0, 6.0, 8.0]
base_pred = [0.0] * len(y)

model = TreeBooster(max_depth=2, reg_lambda=1.0)
model.fit(X, y, base_pred)
print(model.predict(X))  # Outputs tree-based corrections

⚠️ This is a pedagogical implementation. Real XGBoost uses histograms, sparsity, and parallelism for speed.

8. Key Insights

Second-order optimization enables precise leaf weight estimation.
Dual regularization ($\gamma$ and $\lambda$ ) controls structure and output magnitude.
Gain-based splitting ensures only meaningful splits are made.
Leaf grouping allows closed-form optimization of $w_j$ .
Hessian-aware pruning improves stability in uncertain regions.

9. Conclusion

XGBoost’s power lies in its principled fusion of:

Gradient boosting,
Second-order optimization,
Structural and weight regularization.

By understanding how instances are grouped into leaves and how leaf weights are optimized, we move beyond treating XGBoost as a black box.

It’s not magic—it’s math, carefully engineered.

References
[1] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. KDD.
[2] Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine.
[3] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning.

Share on

Twitter Facebook LinkedIn

Understanding XGBoost: A Deep Dive into the Mathematics and From-Scratch Implementation

1. What Is XGBoost? The Additive Model Framework

🔍 What Is an Instance?

🔍 What Is a Leaf in a Decision Tree?

2. The XGBoost Objective: Balancing Fit and Complexity

3. Second-Order Taylor Expansion: Making the Objective Tractable

4. Tree Structure and Leaf-Wise Optimization

🧮 Rewriting the Objective by Leaves

✅ Optimal Leaf Weight

📉 Optimal Objective Value

5. Split Evaluation: The Gain Formula

🧠 Interpretation

6. The Exact Greedy Algorithm: How Splits Are Found

🌲 Handling Ordinal Features

⏱️ Computational Complexity

🛑 Split Validity: `min_child_weight`

7. From-Scratch Python Implementation (Standard Library Only)

✅ Example Usage

8. Key Insights

9. Conclusion

Share on

You may also enjoy

Statistical Modeling and Objective function

My Expert Forecasts Got Crushed by a Dumb Algorithm

Forecasting Accuracy Metrics

The Linear Gaussian State Model: Dynamic Systems

1. What Is XGBoost? The Additive Model Framework

🔍 What Is an Instance?

🔍 What Is a Leaf in a Decision Tree?

2. The XGBoost Objective: Balancing Fit and Complexity

3. Second-Order Taylor Expansion: Making the Objective Tractable

4. Tree Structure and Leaf-Wise Optimization

🧮 Rewriting the Objective by Leaves

✅ Optimal Leaf Weight

📉 Optimal Objective Value

5. Split Evaluation: The Gain Formula

🧠 Interpretation

6. The Exact Greedy Algorithm: How Splits Are Found

🌲 Handling Ordinal Features

⏱️ Computational Complexity

🛑 Split Validity: min_child_weight

7. From-Scratch Python Implementation (Standard Library Only)

✅ Example Usage

8. Key Insights

9. Conclusion

Share on

You may also enjoy

Statistical Modeling and Objective function

My Expert Forecasts Got Crushed by a Dumb Algorithm

Forecasting Accuracy Metrics

The Linear Gaussian State Model: Dynamic Systems

🛑 Split Validity: `min_child_weight`