Maximum likelihood estimation

Looking for good μs? The most likely answer is the MLE!

In many real-world applications, a parametric distribution with known density f(x;θ) where f has parameter(s) θ is a good model for the variable of interest, x. Examples:

If we know the density function f, this is equivalent to knowing the likelihood at any point x0. If we have multiple independent observations xi then the probability of all of these is the product of their individual probabilities (because they’re independent). This product is called the likelihood function, L(θ), so in the examples above we would have

The big idea of maximum likelihood estimation is to consider the likelihood as a function of θ, and solve the question “which value of my parameter was most likely to result in the data I can see?” by maximising L(θ).

The MLE is intuitively sensible, but also has a bewildering array of useful properties, which get better and better as the sample size gets larger - if you’re interested in them try this earworm ©2007 Larry Lesser for a fun summary:

Maximising the likelihood function

In practice this will be implemented (as the default setting!) by your favourite statistics program 😎 however it is worth calling out that although the algebra looks scary at first, it’s often relatively straightforward starting from the known density f.

A couple of noteworthy points:

  1. Normally we work with the log likelihood, log(L(θ)) because each xi likelihood is a very small number and adding small numbers rather than multiplying them helps avoid issues with numerical precision. Note that a function g and log(g) have their maximum at the same point. It’s easy to see how taking logs simplifies calculation: differentiating expressions with is often algebraically easier than those with .
  2. Maximising g is the same as minimising g so either implementation is fine, and similarly multiplying functions by positive constants doesn’t our estimate θ^: g and 2g have a maximum at the same point.