Search This Blog

Saturday, April 22, 2017

Variational Inference - Part 1

Introduction

Practically intractable inference tasks emerges in several problems. For this reason the computation of the marginal or posterior probabilities has to be tackled in an approximate fashion.
Our objective is to approximate an intractable distribution using a simpler distribution . The more obvious choice for quantifying the diversity of two distributions, is the Kullback-Leibler divergence:

We could have chosen the but the expectation w.r.t. is assumed to be intractable.
We can observe that and is zero only if the two distributions are identical.

Inference as Optimization

Suppose that the probabilistic model we are focusing on is composed by observed variables, globally denoted as and latent variables, globally denoted as . Usually we want to compute the posterior distribution:

Where represents the alphabet containing the possible values of the hidden variables .
The presence of the and the necessary marginalization over , makes very difficult the exact computation of the .

In Variational Inference, we seek for a function that approximates the exact conditional . The inference problem is transformed in an optimization problem where we want to minimize the “distance” between the two distributions:

The objective is again not computable. In fact it depends on the Observed Data Log Likelihood :

Instead of minimizing the KL divergence, we optimize another function that is linked to the original objective.

The Evidence Lower Bound

The Observed Data Log Likelihood can be rewritten using an arbitrary distribution over the hidden variables:

Since is a concave function, using Jensen’s Inequality for the Observed Data Log Likelihood, we obtain:

where is the Evidence Lower Bound defined as:

For any choice of , is a lower bound for the Observed Data Log Likelihood.
We can observe that the Evidence Lower Bound is composed by two contributions (an Energy Term and an Entropic Term):

Difference between Likelihood and Evidence Lower Bound

An important observation is that the difference between the Observed Data Log Likelihood and the Evidence Lower Bound is proper the KL-divergence between the distribution and the posterior distribution (use (4) and (8)):

When is a good approximation of , the lower bound is closer to and, in particular, when the approximation is perfect (), .

In this way instead of reducing directly , we can find the that maximizes .

The maximization of reduces and provides, at the same time, an approximation of the posterior and of the log evidence (because tends to zero).

Now an important point is to find a family function that contains and that makes the optimization problem simpler.

References

  • K. Murphy, Machine Learning: A Probabilistic Approach (§21)
  • D. Barber, Bayesian Reasoning and Machine Learning (§28.3)
  • I. Goodfellow et al., Deep Learning (§19.1)
  • D. Blei et al., Variational Inference: A Review for Statisticians
Written using StackEdit [https://stackedit.io/editor#]

Saturday, March 25, 2017

Maximum Likelihood Estimation and Empirical Distribution

The Maximum Likelihood Estimation (MLE) is a method that finds the set of parameters $\hat{\theta}$ of a parametric model  $p_M({\bf x; \theta})$ that makes the data ${\bf X} = \{ {\bf x}^{(t)} \}_{t = 1}^{T}$, drawn from a true but unknown data generating distribution $p_D({\bf x})$, most likely. In other words we are looking for a parametric model $p_M({\bf x; \hat{\theta}})$ that can describe better the data.

The Data Likelihood is $p_M({\bf X} | \theta) = p_M({\bf x}^{(1)} ... {\bf x}^{(T)} | \theta)$ and, since the examples are Independent and Identically Distributed (i.i.d.), can be rewritten as $\displaystyle \prod_{t=1}^{T} p_M({\bf x}^{(t)} | \theta)$ and the Maximum Likelihood Estimation becomes:
\begin{equation}
\begin{array}{rcl}
\hat{\theta} & = & argmax_{\theta} \bigg\{ p_M({\bf x}^{(1)} ... {\bf x}^{(T)} | \theta) \bigg\} \\
& = & argmax_{\theta} \bigg\{ \displaystyle \prod_{t=1}^{T} p_M({\bf x}^{(t)} | \theta) \bigg\} \\
& = & argmax_{\theta} \bigg\{ \displaystyle \sum_{t=1}^{T} \log p_M({\bf x}^{(t)} | \theta) \bigg\}
\end{array}
\end{equation}
The last equation is due to observation that taking the logarithm of the likelihood does not change its argmax but does conveniently transform a product into a sum reducing underflow problems in the calculation.
We can rescale the cost function dividing by $T$ obtaining the final form of the MLE:

\begin{equation}
\hat{\theta} = argmax_{\theta} \bigg\{ \frac{1}{T} \displaystyle \sum_{t=1}^{T} \log p_M({\bf x}^{(t)} | \theta) \bigg\}
\end{equation}

Recalling that the Empirical Distribution is a distribution that puts probability mass of $\frac{1}{T}$ on each of the $T$ points $\{ {\bf x}^{(t)} \}_{t = 1}^{T}$:
\begin{equation}
\hat{p}_D({\bf x}) = \frac{1}{T} \displaystyle \sum_{t=1}^{T} \delta({\bf x - x}^{(t)})
\end{equation}

we can rewrite our estimation problem as:
\begin{equation}
\begin{array}{rcl}
\hat{\theta} & = & argmax_{\theta} \bigg\{ \frac{1}{T} \displaystyle \sum_{t=1}^{T} \log p_M({\bf x}^{(t)} | \theta) \bigg\} \\
& = & argmax_{\theta} \bigg\{ \frac{1}{T} \displaystyle \sum_{t=1}^{T} \log p_M({\bf x} | \theta) \delta({\bf x - x}^{(t)}) \bigg\} \\
& = & argmax_{\theta} \bigg\{ \mathbb{E}_{{\bf x} \sim \hat{p}_D} \log p_M({\bf x} | \theta) \bigg\}
\end{array}
\end{equation}
In the following we will demonstrate that MLE is the same that minimizing the KL Divergence between the empirical distribution and the model distribution.

The KL Divergence between the empirical distribution $\hat{p}_D({\bf x})$ and the model distribution $p_M({\bf x|\theta})$ is:
\begin{equation}
\begin{array}{rcl}
\mathbb{D}_{KL}(\hat{p}_D||p_M) & = & \displaystyle \mathbb{E}_{{\bf x} \sim \hat{p}_D} \Big (\log \frac{\hat{p}_D({\bf x})}{p_M({\bf x|\theta})} \Big) \\
& = &\displaystyle \mathbb{E}_{{\bf x} \sim \hat{p}_D} \Big (\log \hat{p}_D({\bf x}) \Big ) - \mathbb{E}_{{\bf x} \sim \hat{p}_D({\bf x})} \Big (\log p_M({\bf x|\theta}) \Big)\\
\end{array}
\end{equation}

Since the first term is not dependent from $\theta$, when we train the model to minimize KL divergence, we need only to minimize the cross-entropy between the empirical distribution defined by the training set ($\hat{p}_D({\bf x})$) and the model ($p_M({\bf x|\theta})$):
\begin{equation}
\mathbb{H}(\hat{p}_D({\bf x}),p_M({\bf x|\theta})) = - \mathbb{E}_{{\bf x} \sim \hat{p}_D({\bf x})} \Big (\log p_M({\bf x|\theta}) \Big)
\end{equation}
which is, the negative of the term we try to maximize in MLE.

We can thus see Maximum Likelihood Estimation as an attempt to make the model distribution match the Empirical Distribution $\hat{p}_D({\bf x})$. Ideally, we would like to match the true data generating distribution $p_D({\bf x})$ but we don have a direct access to this distribution and the Empirical Distribution is the best we can have.

Usually, instead of Maximizing the Log-Likelihood, we Minimize the Negative Log-Likelihood (equivalently Minimize the Cross-Entropy) leading to the following equation:
\begin{equation}
\begin{array}{rcl}
\hat{\theta} = argmin_{\theta} \bigg\{ - \mathbb{E}_{{\bf x} \sim \hat{p}_D} \log p_M({\bf x} | \theta) \bigg\} \\
\end{array}
\end{equation}

We can conclude that the empirical distribution is the probability density that maximizes the likelihood of the training data.

Note:
The Dirac delta distribution is only necessary to define the empirical distribution over continuous variables. For discrete variables an empirical distribution is categorical distribution, with a probability associated to each possible input value that is simply equal to the empirical frequency of that value in the training set.


This post has been inspired by I. Goodfellow et al. Deep Learning (5.5)