From the Partition Function ("God Equation") to Linear Regression, GLMs, and SVMs

George Deac

Feb 1, 2025 — 14 min read

We will explore a conceptual roadmap showing how one can connect the “god equation” of statistical mechanics (the partition function) to various common machine learning models (linear regression, generalized linear models, and SVMs). We will also highlight how linear regression and SVM can both be viewed as special cases that emerge from different choices of “energy” or “loss” functions within the partition function framework.

In recent years, these parallels have sparked significant interest in cross-pollination between physics and machine learning: techniques like variational inference, Markov chain Monte Carlo, and energy-based models reflect how the partition function framework can inform learning algorithms and vice versa.

By leveraging insights from statistical mechanics (such as free energy minimization and phase transitions), researchers have developed better theoretical understanding and new computational methods in AI. This post explores how this partition function underpins familiar machine learning models, showing a unifying perspective on linear regression, generalized linear models, and SVMs.

1. The Partition Function (“God Equation”)

The partition function $Z$ is a central object in statistical mechanics that encapsulates how the probability of a system’s state depends on its energy and temperature. Historically, its emergence dates back to the foundations of statistical physics in the 19th century, from a canonical ensemble theory by researchers like Ludwig Boltzmann and Josiah Willard Gibbs. In physics, the partition function is used to derive thermodynamic quantities (e.g., free energy, entropy, average energy) by summing or integrating over all possible states of the system, each weighted by an exponential factor of its energy.

In statistical mechanics, the partition function $Z(\beta)$ for a system of states $x$ (which might be discrete or continuous) is given by:

\[ Z(\beta) \;=\; \sum_x e^{-\beta\,E(x)} \quad\text{(discrete case)} \quad \text{or} \quad Z(\beta) \;=\; \int e^{-\beta\,E(x)}\,dx \quad\text{(continuous case)}, \]

where:

- $E(x)$ is the energy of state $x$.
- $\beta = 1/(k_B T)$ is the inverse temperature (in natural units one often sets $k_B = 1$).

From $Z(\beta)$, one defines the Boltzmann distribution:

\[ p(x) \;=\; \frac{1}{Z(\beta)}\, e^{-\beta\,E(x)}. \]

In a Bayesian or machine-learning context, this can be seen as a posterior distribution over parameters $w$, if you identify $\beta\,E(w)$ with the negative log posterior (loss + regularizer). Minimizing the free energy $-\frac{1}{\beta}\ln Z(\beta)$ then becomes closely related to finding a maximum a posteriori (MAP) or maximum likelihood solution, depending on how you set up $E(\cdot)$.

2. Linking Statistical Mechanics to Machine Learning

In machine learning, especially in Bayesian approaches, there is a deep connection to the partition function concept: one can view the posterior distribution over model parameters through a similar lens, by interpreting the negative log-likelihood (plus regularization) as an “energy.” The exponential of the negative energy then becomes something akin to a Boltzmann factor, and the integral over all parameters can be regarded as a “partition function”.

Indeed, this connection highlights a fascinating parallel: just as physicists study how macroscopic observables emerge from microscopic states, machine learning practitioners examine how global predictions or decisions arise from underlying parameters trained on data.

For ML, we often want to find a parameter vector $\mathbf{w}$ (for instance, the weights of a linear model). The common Bayesian viewpoint is:

1. Assign a prior $p(\mathbf{w})$.
2. Write down a likelihood $p(\mathbf{y} \mid \mathbf{x}, \mathbf{w})$ for the observed data $\{(\mathbf{x}_i, y_i)\}_{i=1}^N$.
3. Obtain the posterior $p(\mathbf{w} \mid \{\mathbf{x}_i, y_i\}) \propto p(\mathbf{y} \mid \mathbf{x}, \mathbf{w}) \,p(\mathbf{w})$.

Taking a negative log of this posterior leads to an energy function:

\[ E(\mathbf{w}) \;=\; -\log p(\mathbf{y} \mid \mathbf{x}, \mathbf{w}) \;-\log p(\mathbf{w}), \]

which is effectively a loss function + regularizer. In other words,

\[ \beta\,E(\mathbf{w}) \;=\; -\log p(\mathbf{y} \mid \mathbf{x}, \mathbf{w}) \;-\log p(\mathbf{w}), \]

and the partition function of $\mathbf{w}$ under this energy is

\[ Z(\beta) \;=\; \int e^{-\beta E(\mathbf{w})} d\mathbf{w}. \]

3. Deriving Various Models by Choosing Different Energies (Losses)

3.1 Linear Regression (with Gaussian Noise)

Model assumption:

\[ y_i \;=\; \mathbf{w}^\top \mathbf{x}_i \;+\; \epsilon_i,\quad \epsilon_i \sim \mathcal{N}(0,\sigma^2). \]

- Likelihood: The probability of observing $y_i$ given $\mathbf{x}_i$ and $\mathbf{w}$ is: \[ p(y_i \mid \mathbf{x}_i,\mathbf{w}) \;=\; \frac{1}{\sqrt{2\pi}\sigma}\exp\!\Bigl(-\frac{(y_i - \mathbf{w}^\top\mathbf{x}_i)^2}{2\sigma^2}\Bigr). \]

- Negative log-likelihood: Summing over $N$ data points, we get \[ -\log p(\mathbf{y}\mid\mathbf{X},\mathbf{w}) \;=\; \sum_{i=1}^N \frac{(y_i - \mathbf{w}^\top\mathbf{x}_i)^2}{2\sigma^2} \;+\; \text{const}, \] which is proportional to the sum of squared errors.

If we include a Gaussian prior on $\mathbf{w}$ (say with variance $\tau^2$), then \[ E(\mathbf{w}) \;=\; \underbrace{\sum_{i=1}^N (y_i - \mathbf{w}^\top\mathbf{x}_i)^2}_\text{Squared error} \;+\; \lambda \|\mathbf{w}\|^2 \] up to constants and scale factors. Minimizing this energy yields Ridge Regression (L2 regularization).

Hence from the viewpoint of \[ p(\mathbf{w}) \;=\; \exp\bigl(-\beta\lambda \|\mathbf{w}\|^2\bigr), \] the partition function is (in principle) an integral over all $\mathbf{w}$. However, the MAP or minimal-energy solution is precisely linear regression with L2 regularization.

3.2 Generalized Linear Models (GLMs)

A generalized linear model replaces the Gaussian likelihood assumption with an exponential-family distribution. The standard GLM has the form

\[ \mathbb{E}[y_i \mid \mathbf{x}_i,\mathbf{w}] = g^{-1}(\mathbf{w}^\top \mathbf{x}_i), \]
where $g^{-1}$ is the link function (e.g., logit for logistic regression, log for Poisson regression, etc.)

- For logistic regression, one has \[ p(y_i \mid \mathbf{x}_i,\mathbf{w}) = \sigma(\mathbf{w}^\top \mathbf{x}_i)^{\,y_i}\, \bigl[1 - \sigma(\mathbf{w}^\top \mathbf{x}_i)\bigr]^{\,1 - y_i}, \] where $\sigma(z) = 1/[1 + e^{-z}]$. Its negative log-likelihood is the cross-entropy (logistic loss).

- For Poisson regression, one has \[ p(y_i \mid \mathbf{x}_i,\mathbf{w}) = \frac{e^{-\lambda_i}\,\lambda_i^{\,y_i}}{y_i!}, \quad \lambda_i = \exp(\mathbf{w}^\top \mathbf{x}_i). \]

Again, each choice of likelihood leads to a specific form of energy (negative log-likelihood + prior), and from there the partition function can be written in principle as \[ Z(\beta) = \int e^{-\beta E(\mathbf{w})}\,d\mathbf{w}. \] Minimizing $E(\mathbf{w})$ recovers the usual GLM with possibly an L2 regularizer (or other priors).

3.3 Support Vector Machine

An SVM (for linear classification) typically arises by directly defining the so-called hinge loss. For binary labels $y_i \in \{-1,+1\}$:

\[ \ell(\mathbf{w};(\mathbf{x}_i,y_i)) \;=\; \max\bigl(0,\;1 - y_i\,(\mathbf{w}^\top \mathbf{x}_i)\bigr). \]

Summing over $i$, plus adding a regularizer $\frac{1}{2}\|\mathbf{w}\|^2$, defines the energy function:

\[ E(\mathbf{w}) \;=\; C \sum_{i=1}^N \max(0,\; 1 - y_i\,(\mathbf{w}^\top \mathbf{x}_i)) \;+\; \frac{1}{2}\|\mathbf{w}\|^2, \] where $C$ is a hyperparameter balancing data misfit vs. margin. In principle, one might try to incorporate this into a Boltzmann-like form

\[ p(\mathbf{w}) = \frac{1}{Z(\beta)}\exp\bigl[-\beta\,E(\mathbf{w})\bigr], \]

but typically the SVM is obtained by directly minimizing the hinge-loss-based functional rather than from a strict probabilistic likelihood model. Still, the same structure (loss + regularizer) can be viewed as an energy function.

3.4 SVM with Non-Linear Kernels

So far, we've discussed linear SVMs, which use a simple dot product $\mathbf{w}^\top \mathbf{x}_i$. However, one of the key strengths of SVMs is the ability to lift the data into higher-dimensional (possibly infinite-dimensional) feature spaces via kernels. This allows SVMs to capture non-linear decision boundaries while keeping the same hinge-loss formulation in the higher-dimensional feature space.

The Kernel Trick

A kernel $k(\mathbf{x}, \mathbf{x}')$ acts like an inner product in some feature space $\phi(\mathbf{x})$. That is, we replace $\mathbf{w}^\top \mathbf{x}$ with:

\[ \mathbf{w}^\top \phi(\mathbf{x}) \;\;\; \longleftrightarrow \;\;\; \sum_{j=1}^N \alpha_j\, y_j \, k(\mathbf{x}_j, \mathbf{x}). \]

- $\phi(\mathbf{x})$ is an (often high-dimensional) mapping.
- $k(\mathbf{x}, \mathbf{x}') = \langle \phi(\mathbf{x}), \phi(\mathbf{x}') \rangle$ is the kernel function.

The SVM formulation using the hinge loss remains:

\[ \ell(\mathbf{w};(\mathbf{x}_i,y_i)) \;=\; \max\bigl(0,\;1 - y_i\,(\mathbf{w}^\top \phi(\mathbf{x}_i))\bigr), \]
but we never explicitly compute $\phi(\mathbf{x})$. Instead, we rely on evaluations of the kernel $k(\mathbf{x}_i, \mathbf{x}_j)$.

Example: Gaussian (RBF) Kernel

A popular kernel is the Gaussian Radial Basis Function (RBF):

\[ k(\mathbf{x}, \mathbf{x}') \;=\; \exp\!\Bigl(-\tfrac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\sigma^2}\Bigr). \]

1. High-Dimensional Feature Space: The mapping $\phi(\cdot)$ corresponding to this kernel is infinite-dimensional, but the kernel trick lets us handle it implicitly.
2. Hinge Loss with RBF Kernel: The energy (or objective) function in the primal still looks like

\[ E(\mathbf{w}) \;=\; C \sum_{i=1}^N \max\bigl(0,\;1 - y_i\,(\mathbf{w}^\top \phi(\mathbf{x}_i))\bigr) \;+\; \frac{1}{2}\|\mathbf{w}\|^2, \]

but in the dual formulation, you see that $\mathbf{w}^\top \phi(\mathbf{x}_i)$ is replaced by

\[ \sum_{j=1}^N \alpha_j\, y_j \, k(\mathbf{x}_j, \mathbf{x}_i), \]

where $\alpha_j$ are the dual variables.

In fact, many references use the terms “Gaussian kernel” and “RBF kernel” interchangeably. However, you may see slightly different parameterizations:

1. Gaussian/RBF Kernel (common in many textbooks): \[ k(\mathbf{x}, \mathbf{x}') \;=\; \exp\!\Bigl(-\tfrac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\sigma^2}\Bigr). \]

2. RBF Kernel (common in implementations): \[ k(\mathbf{x}, \mathbf{x}') \;=\; \exp\!\bigl(-\gamma\,\|\mathbf{x} - \mathbf{x}'\|^2\bigr), \]
where the relationship between $\gamma$ and $\sigma^2$ is typically $\gamma = 1/(2\sigma^2)$.

Why the is the RBF Kernel used?
- Universality: The RBF kernel is known to be a “universal” kernel formulation, capable of approximating a large variety of functions given sufficient data and proper tuning of $\sigma$ or $\gamma$.
- Smoothness: The exponential decay enforces smooth decision boundaries, which often helps in practice.
- Default Choice: Many standard SVM implementations default to the RBF kernel because it typically performs well across a wide range of problems, without requiring the user to hand-design features or transformations.

In contrast, a purely “Gaussian” kernel with a fixed parameter can sometimes be too rigid or too broad/narrow if $\sigma$ is not carefully tuned. The RBF formulation with a tunable $\gamma$ parameter is more convenient and flexible in modern implementations. Ultimately, both forms represent the same family of kernels, differing mainly in the notation and hyperparameter conventions.

Partition Function Perspective for Kernel SVM

From a partition function standpoint, if we wanted to view kernel SVMs in an energy-based manner, we could write:

\[ E(\mathbf{w}) \;=\; C \sum_{i=1}^N \max\bigl(0,\;1 - y_i\,(\mathbf{w}^\top \phi(\mathbf{x}_i))\bigr) \;+\; \frac{1}{2}\|\mathbf{w}\|^2, \]

and define

\[ p(\mathbf{w}) \;=\; \frac{1}{Z(\beta)}\,\exp\bigl[-\beta\,E(\mathbf{w})\bigr]. \]

While in practice, we solve the dual optimization problem for SVMs with kernels rather than integrating over $\mathbf{w}$, the conceptual link remains the same. One can still imagine the “Boltzmann factor” over the parameter space induced by $\phi(\mathbf{x})$. Ultimately, the kernel does not change the hinge loss form; it only modifies (potentially in a very high-dimensional space) how $\mathbf{w}$ interacts with the data.

Thus, non-linear SVMs are still part of the same “energy + partition function” framework, but they rely on a kernel function to express the interactions in a transformed feature space.

4.1. Relation Between LR and SVM

From this statistical-mechanical (or Bayesian) viewpoint, both linear regression and SVM can be seen as minimizers of

\[ E(\mathbf{w}) \;=\; \sum_{i=1}^N \ell\bigl(y_i, \mathbf{w}^\top \mathbf{x}_i\bigr) \;+\; \lambda\,R(\mathbf{w}), \]

where:

- For linear regression (least squares), \[ \ell_\text{LS}\bigl(y_i, \mathbf{w}^\top\mathbf{x}_i\bigr) = (y_i - \mathbf{w}^\top\mathbf{x}_i)^2 \] and $R(\mathbf{w}) = \|\mathbf{w}\|^2$.

- For SVM, \[ \ell_\text{hinge}\bigl(y_i, \mathbf{w}^\top\mathbf{x}_i\bigr) = \max\bigl(0,\;1 - y_i\,(\mathbf{w}^\top\mathbf{x}_i)\bigr) \] and $R(\mathbf{w}) = \|\mathbf{w}\|^2$.

Hence the main difference is the shape of the loss term:

Linear regression uses a quadratic loss (squared errors).
SVM uses a hinge loss that forces data to be on the correct side of the margin (or pays a linear penalty if not).

From a partition function perspective, one might imagine:

\[ Z(\beta) = \int \exp\!\bigl[ -\beta\,\bigl(\textstyle\sum_{i=1}^N \ell(y_i,\,\mathbf{w}^\top\mathbf{x}_i) \;+\; \lambda \|\mathbf{w}\|^2\bigr)\bigr] \, d\mathbf{w}. \]

- As $\beta \to \infty$ (“low temperature”), the measure concentrates on the global minimum of $E(\mathbf{w})$, reproducing the usual minimization problem.

- Different choices of $\ell$ lead to different machine learning models, but the structure an ( energy function whose Boltzmann factor is integrated over $\mathbf{w}$ ) is fundamentally the same.

Thus, from a unifying viewpoint:

Start with a Boltzmann distribution (partition function).
Define the “energy” to be your loss + regularizer.
Identify the MAP (or zero-temperature) limit with standard machine learning algorithms.

4.2. Relation Between Nonlinear SVMs (RBF Kernel) and GLMs (Gaussian Kernel)

When we introduce the RBF (Gaussian) kernel into an SVM, we are effectively mapping inputs $\mathbf{x}$ into a (potentially infinite) high-dimensional feature space via
\[ k(\mathbf{x}, \mathbf{x}') \;=\; \exp\!\Bigl(-\tfrac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\sigma^2}\Bigr). \]

The SVM then minimizes the hinge loss in that feature space (plus a norm penalty on $\mathbf{w}$):

\[ E(\mathbf{w}) \;=\; C \sum_{i=1}^N \max\!\Bigl(0,\,1 - y_i\bigl(\mathbf{w}^\top \phi(\mathbf{x}_i)\bigr)\Bigr) \;+\; \frac12\|\mathbf{w}\|^2, \]
where $\phi(\mathbf{x})$ is the high-dimensional embedding implied by the kernel.

From a partition function perspective, one might imagine:

We can again define a Boltzmann-like distribution
\[ p(\mathbf{w}) \;=\; \frac{1}{Z(\beta)}\,\exp\bigl[-\beta\,E(\mathbf{w})\bigr], \]
where $Z(\beta)$ integrates over all parameters $\mathbf{w}$ in the (possibly infinite-dimensional) feature space. As $\beta \to \infty$, the measure concentrates on the minimal-energy solution, recovering the usual SVM solution with the RBF kernel.

In generalized linear models, the energy (negative log-posterior) is driven by a negative log-likelihood (depending on the exponential-family form) plus a prior on parameters. If we also incorporate a Gaussian kernel, in the style of a kernelized logistic regression or kernel ridge regression, we effectively replace the simple linear predictor $\mathbf{w}^\top \mathbf{x}$ with a kernel expansion: \[ f(\mathbf{x}) \;=\; \sum_{j=1}^N \alpha_j \, k(\mathbf{x}_j, \mathbf{x}). \]

- In kernel ridge regression (a GLM under Gaussian noise and a Gaussian prior), one ends up minimizing a sum of squared errors plus a norm in the RKHS (Reproducing Kernel Hilbert Space) implied by $k(\cdot,\cdot)$.
- In kernel logistic regression, one ends up minimizing the cross-entropy loss plus a similar norm penalty.

Just like the kernel SVM, a kernel GLM can also be interpreted via a partition function:

\[ Z(\beta) \;=\; \int \exp\!\Bigl(-\beta\,\underbrace{\bigl[\text{likelihood-loss} + \text{regularizer}\bigr]}_{\text{energy}} \Bigr) \,d\theta, \]
where $\theta$ might be the vector of coefficients $\{\alpha_j\}$ in the kernel expansion, or equivalently $\mathbf{w}$ in some feature space.

Key Differences: Hinge Loss vs. Likelihood-Based Loss

The core structural difference remains the choice of loss:
- SVM (RBF Kernel) uses the hinge loss, which enforces margin-based classification.
- GLM (Gaussian Kernel) uses a negative log-likelihood (e.g., cross-entropy for logistic, squared error for Gaussian response), tied to an explicit probabilistic interpretation.

Both, however, rely on:
1. A high-dimensional embedding of $\mathbf{x}$ (implicit via $k(\mathbf{x},\mathbf{x}')$).
2. A norm penalty in that feature space (regularizer).
3. Minimization of an energy that can be expressed in a Boltzmann-like framework.

5. Summary of the Connections

Partition Function: $Z(\beta) = \int e^{-\beta E(\mathbf{w})}\,d\mathbf{w}.$
This “god equation” from statistical mechanics can be interpreted in machine learning by relating the **energy** $E(\mathbf{w})$ to a **loss function + regularizer\).
Energy $\rightarrow$ Loss + Regularizer: \[ E(\mathbf{w}) = \sum_{i=1}^N \ell\bigl(y_i, f_\mathbf{w}(\mathbf{x}_i)\bigr) + \lambda\, R(\mathbf{w}). \]
In other words, $\ell(\cdot)$ is your chosen loss function, and $R(\mathbf{w})$ is a regularization term.
Different Losses $\rightarrow$ Different Models:
- Squared error $\to$ Linear Regression.
- Logistic (cross-entropy) $\to$ Logistic Regression (a GLM).
- Hinge loss $\to$ SVM.
Relation of Linear Regression and SVM:
- They differ only by the form of the loss function (squared loss vs. hinge loss).
- Both can be recast (if one wishes) into a Boltzmann-like setup via $e^{-\beta E(\mathbf{w})}$.
- At “low temperature” ($\beta\to\infty$), finding the most probable parameters under the Boltzmann distribution is the same as **minimizing** their respective energy function.
Incorporating Kernels:
- For Kernel SVMs, the hinge loss is the same, but $\mathbf{w}^\top \mathbf{x}$ is replaced by $\mathbf{w}^\top \phi(\mathbf{x})$, where $\phi$ is implied by a kernel $k(\mathbf{x}, \mathbf{x}')$.
- Similarly, for Kernel GLMs (e.g., Kernel Ridge Regression, Kernel Logistic Regression), one replaces the linear predictor $\mathbf{w}^\top \mathbf{x}$ with $\sum_j \alpha_j\,k(\mathbf{x}_j, \mathbf{x})$.
- Both setups still conform to the same energy-based (loss + regularizer) perspective.

Hence, starting from the same statistical-mechanical partition function and simply changing the definition of “energy” (i.e., negative log-likelihood plus regularizer, or hinge loss plus regularizer) yields linear regression, generalized linear models, and SVMs (including kernelized versions) as special cases within a unifying theoretical framework.

Takeaway:
- Linear Regression and SVM can be viewed on the same footing via the idea of an energy-based model.
- Both have the same structural form (minimizing loss + regularizer), but differ in which loss function (and sometimes which regularizer) is used.
- All these models (LR, GLMs, SVM, and their kernel variants) can be traced to a statistical-mechanical partition function, where one interprets the Boltzmann factor as \[ \exp\bigl(-\beta \times \bigl[\text{loss} + \text{regularizer}\bigr]\bigr). \]
- In the large $\beta$ (low-temperature) limit, the distribution collapses onto the global minimum of the energy, recovering standard machine learning via empirical risk minimization.