Search This Blog

Saturday, March 25, 2017

Maximum Likelihood Estimation and Empirical Distribution

The Maximum Likelihood Estimation (MLE) is a method that finds the set of parameters $\hat{\theta}$ of a parametric model  $p_M({\bf x; \theta})$ that makes the data ${\bf X} = \{ {\bf x}^{(t)} \}_{t = 1}^{T}$, drawn from a true but unknown data generating distribution $p_D({\bf x})$, most likely. In other words we are looking for a parametric model $p_M({\bf x; \hat{\theta}})$ that can describe better the data.

The Data Likelihood is $p_M({\bf X} | \theta) = p_M({\bf x}^{(1)} ... {\bf x}^{(T)} | \theta)$ and, since the examples are Independent and Identically Distributed (i.i.d.), can be rewritten as $\displaystyle \prod_{t=1}^{T} p_M({\bf x}^{(t)} | \theta)$ and the Maximum Likelihood Estimation becomes:
\begin{equation}
\begin{array}{rcl}
\hat{\theta} & = & argmax_{\theta} \bigg\{ p_M({\bf x}^{(1)} ... {\bf x}^{(T)} | \theta) \bigg\} \\
& = & argmax_{\theta} \bigg\{ \displaystyle \prod_{t=1}^{T} p_M({\bf x}^{(t)} | \theta) \bigg\} \\
& = & argmax_{\theta} \bigg\{ \displaystyle \sum_{t=1}^{T} \log p_M({\bf x}^{(t)} | \theta) \bigg\}
\end{array}
\end{equation}
The last equation is due to observation that taking the logarithm of the likelihood does not change its argmax but does conveniently transform a product into a sum reducing underflow problems in the calculation.
We can rescale the cost function dividing by $T$ obtaining the final form of the MLE:

\begin{equation}
\hat{\theta} = argmax_{\theta} \bigg\{ \frac{1}{T} \displaystyle \sum_{t=1}^{T} \log p_M({\bf x}^{(t)} | \theta) \bigg\}
\end{equation}

Recalling that the Empirical Distribution is a distribution that puts probability mass of $\frac{1}{T}$ on each of the $T$ points $\{ {\bf x}^{(t)} \}_{t = 1}^{T}$:
\begin{equation}
\hat{p}_D({\bf x}) = \frac{1}{T} \displaystyle \sum_{t=1}^{T} \delta({\bf x - x}^{(t)})
\end{equation}

we can rewrite our estimation problem as:
\begin{equation}
\begin{array}{rcl}
\hat{\theta} & = & argmax_{\theta} \bigg\{ \frac{1}{T} \displaystyle \sum_{t=1}^{T} \log p_M({\bf x}^{(t)} | \theta) \bigg\} \\
& = & argmax_{\theta} \bigg\{ \frac{1}{T} \displaystyle \sum_{t=1}^{T} \log p_M({\bf x} | \theta) \delta({\bf x - x}^{(t)}) \bigg\} \\
& = & argmax_{\theta} \bigg\{ \mathbb{E}_{{\bf x} \sim \hat{p}_D} \log p_M({\bf x} | \theta) \bigg\}
\end{array}
\end{equation}
In the following we will demonstrate that MLE is the same that minimizing the KL Divergence between the empirical distribution and the model distribution.

The KL Divergence between the empirical distribution $\hat{p}_D({\bf x})$ and the model distribution $p_M({\bf x|\theta})$ is:
\begin{equation}
\begin{array}{rcl}
\mathbb{D}_{KL}(\hat{p}_D||p_M) & = & \displaystyle \mathbb{E}_{{\bf x} \sim \hat{p}_D} \Big (\log \frac{\hat{p}_D({\bf x})}{p_M({\bf x|\theta})} \Big) \\
& = &\displaystyle \mathbb{E}_{{\bf x} \sim \hat{p}_D} \Big (\log \hat{p}_D({\bf x}) \Big ) - \mathbb{E}_{{\bf x} \sim \hat{p}_D({\bf x})} \Big (\log p_M({\bf x|\theta}) \Big)\\
\end{array}
\end{equation}

Since the first term is not dependent from $\theta$, when we train the model to minimize KL divergence, we need only to minimize the cross-entropy between the empirical distribution defined by the training set ($\hat{p}_D({\bf x})$) and the model ($p_M({\bf x|\theta})$):
\begin{equation}
\mathbb{H}(\hat{p}_D({\bf x}),p_M({\bf x|\theta})) = - \mathbb{E}_{{\bf x} \sim \hat{p}_D({\bf x})} \Big (\log p_M({\bf x|\theta}) \Big)
\end{equation}
which is, the negative of the term we try to maximize in MLE.

We can thus see Maximum Likelihood Estimation as an attempt to make the model distribution match the Empirical Distribution $\hat{p}_D({\bf x})$. Ideally, we would like to match the true data generating distribution $p_D({\bf x})$ but we don have a direct access to this distribution and the Empirical Distribution is the best we can have.

Usually, instead of Maximizing the Log-Likelihood, we Minimize the Negative Log-Likelihood (equivalently Minimize the Cross-Entropy) leading to the following equation:
\begin{equation}
\begin{array}{rcl}
\hat{\theta} = argmin_{\theta} \bigg\{ - \mathbb{E}_{{\bf x} \sim \hat{p}_D} \log p_M({\bf x} | \theta) \bigg\} \\
\end{array}
\end{equation}

We can conclude that the empirical distribution is the probability density that maximizes the likelihood of the training data.

Note:
The Dirac delta distribution is only necessary to define the empirical distribution over continuous variables. For discrete variables an empirical distribution is categorical distribution, with a probability associated to each possible input value that is simply equal to the empirical frequency of that value in the training set.


This post has been inspired by I. Goodfellow et al. Deep Learning (5.5)

No comments:

Post a Comment