Maximum likelihood estimation

Looking for good \(\mu\)s? The most likely answer is the MLE!

In many real-world applications, a parametric distribution with known density \(f(x;\theta)\) where \(f\) has parameter(s) \(\theta\) is a good model for the variable of interest, \(x\). Examples:

If we know the density function \(f\), this is equivalent to knowing the likelihood at any point \(x_0\). If we have multiple independent observations \({x_i}\) then the probability of all of these is the product of their individual probabilities (because they’re independent). This product is called the likelihood function, \(\mathcal{L}(\theta)\), so in the examples above we would have

The big idea of maximum likelihood estimation is to consider the likelihood as a function of \(\theta\), and solve the question “which value of my parameter was most likely to result in the data I can see?” by maximising \(\mathcal{L}(\theta)\).

The MLE is intuitively sensible, but also has a bewildering array of useful properties, which get better and better as the sample size gets larger - if you’re interested in them try this earworm ©2007 Larry Lesser for a fun summary:

Maximising the likelihood function

In practice this will be implemented (as the default setting!) by your favourite statistics program 😎 however it is worth calling out that although the algebra looks scary at first, it’s often relatively straightforward starting from the known density \(f\).

A couple of noteworthy points:

  1. Normally we work with the log likelihood, \(\textrm{log}(\mathcal{L}(\theta))\) because each \(x_i\) likelihood is a very small number and adding small numbers rather than multiplying them helps avoid issues with numerical precision. Note that a function \(g\) and \(\textrm{log}(g)\) have their maximum at the same point. It’s easy to see how taking logs simplifies calculation: differentiating expressions with \(\sum{}\) is often algebraically easier than those with \(\prod{}\).
  2. Maximising \(g\) is the same as minimising \(-g\) so either implementation is fine, and similarly multiplying functions by positive constants doesn’t our estimate \(\hat{\theta}\): \(g\) and \(2g\) have a maximum at the same point.