Looking for good $μ$ s? The most likely answer is the MLE!

In many real-world applications, a parametric distribution with known density $f (x; θ)$ where $f$ has parameter(s) $θ$ is a good model for the variable of interest, $x$ . Examples:

If $x$ represents heights, then we might have $f = \frac{1}{σ \sqrt{2 π}} \exp [- \frac{1}{2} (\frac{x - μ}{σ})^{2}]$ with $θ$ the vector $(μ, σ^{2})$ .
If $x$ represents a count, a Poisson distribution is often sensible $f = \frac{λ^{x} e^{- x}}{x!}$ here $θ = λ$ .

If we know the density function $f$ , this is equivalent to knowing the likelihood at any point $x_{0}$ . If we have multiple independent observations $x_{i}$ then the probability of all of these is the product of their individual probabilities (because they’re independent). This product is called the likelihood function, $L (θ)$ , so in the examples above we would have

$L_{n o r m a l} (θ) = L (μ, σ^{2}) = \prod_{i = 1}^{n} \frac{1}{σ \sqrt{2 π}} \exp [- \frac{1}{2} (\frac{x_{i} - μ}{σ})^{2}]$ = $(\frac{1}{σ \sqrt{2 π}})^{n} \exp [- \frac{1}{2 σ^{2}} \sum_{i = 1}^{n} (x_{i} - μ)^{2}]$
$L_{p o i s s o n} (θ) = L (λ) = \prod_{i = 1}^{n} \frac{λ_{i}^{x} e^{- x_{i}}}{x_{i}!}$

The big idea of maximum likelihood estimation is to consider the likelihood as a function of $θ$ , and solve the question “which value of my parameter was most likely to result in the data I can see?” by maximising $L (θ)$ .

The MLE is intuitively sensible, but also has a bewildering array of useful properties, which get better and better as the sample size gets larger - if you’re interested in them try this earworm ©2007 Larry Lesser for a fun summary:

Maximising the likelihood function

In practice this will be implemented (as the default setting!) by your favourite statistics program 😎 however it is worth calling out that although the algebra looks scary at first, it’s often relatively straightforward starting from the known density $f$ .

A couple of noteworthy points:

Normally we work with the log likelihood, $log (L (θ))$ because each $x_{i}$ likelihood is a very small number and adding small numbers rather than multiplying them helps avoid issues with numerical precision. Note that a function $g$ and $log (g)$ have their maximum at the same point. It’s easy to see how taking logs simplifies calculation: differentiating expressions with $\sum$ is often algebraically easier than those with $\prod$ .
Maximising $g$ is the same as minimising $- g$ so either implementation is fine, and similarly multiplying functions by positive constants doesn’t our estimate $\hat{θ}$ : $g$ and $2 g$ have a maximum at the same point.

Maximum likelihood estimation

Maximising the likelihood function