Introduction
Practically intractable inference tasks emerges in several problems. For this reason the computation of the marginal or posterior probabilities has to be tackled in an approximate fashion.Our objective is to approximate an intractable distribution using a simpler distribution . The more obvious choice for quantifying the diversity of two distributions, is the Kullback-Leibler divergence:
We could have chosen the but the expectation w.r.t. is assumed to be intractable.
We can observe that and is zero only if the two distributions are identical.
Inference as Optimization
Suppose that the probabilistic model we are focusing on is composed by observed variables, globally denoted as and latent variables, globally denoted as . Usually we want to compute the posterior distribution:Where represents the alphabet containing the possible values of the hidden variables .
The presence of the and the necessary marginalization over , makes very difficult the exact computation of the .
In Variational Inference, we seek for a function that approximates the exact conditional . The inference problem is transformed in an optimization problem where we want to minimize the “distance” between the two distributions:
The objective is again not computable. In fact it depends on the Observed Data Log Likelihood :
Instead of minimizing the KL divergence, we optimize another function that is linked to the original objective.
The Evidence Lower Bound
The Observed Data Log Likelihood can be rewritten using an arbitrary distribution over the hidden variables:Since is a concave function, using Jensen’s Inequality for the Observed Data Log Likelihood, we obtain:
where is the Evidence Lower Bound defined as:
For any choice of , is a lower bound for the Observed Data Log Likelihood.
We can observe that the Evidence Lower Bound is composed by two contributions (an Energy Term and an Entropic Term):
Difference between Likelihood and Evidence Lower Bound
An important observation is that the difference between the Observed Data Log Likelihood and the Evidence Lower Bound is proper the KL-divergence between the distribution and the posterior distribution (use (4) and (8)):When is a good approximation of , the lower bound is closer to and, in particular, when the approximation is perfect (), .
In this way instead of reducing directly , we can find the that maximizes .
The maximization of reduces and provides, at the same time, an approximation of the posterior and of the log evidence (because tends to zero).
Now an important point is to find a family function that contains and that makes the optimization problem simpler.
References
- K. Murphy, Machine Learning: A Probabilistic Approach (§21)
- D. Barber, Bayesian Reasoning and Machine Learning (§28.3)
- I. Goodfellow et al., Deep Learning (§19.1)
- D. Blei et al., Variational Inference: A Review for Statisticians