Looking for good s? The most likely answer is the MLE!
In many real-world applications, a parametric distribution with known density where has parameter(s) is a good model for the variable of interest, . Examples:
- If represents heights, then we might have with the vector .
- If represents a count, a Poisson distribution is often sensible here .
If we know the density function , this is equivalent to knowing the likelihood at any point . If we have multiple independent observations then the probability of all of these is the product of their individual probabilities (because they’re independent). This product is called the likelihood function, , so in the examples above we would have
- =
The big idea of maximum likelihood estimation is to consider the likelihood as a function of , and solve the question “which value of my parameter was most likely to result in the data I can see?” by maximising .
The MLE is intuitively sensible, but also has a bewildering array of useful properties, which get better and better as the sample size gets larger - if you’re interested in them try this earworm ©2007 Larry Lesser for a fun summary:
Maximising the likelihood function
In practice this will be implemented (as the default setting!) by your favourite statistics program 😎 however it is worth calling out that although the algebra looks scary at first, it’s often relatively straightforward starting from the known density .
A couple of noteworthy points:
- Normally we work with the log likelihood, because each likelihood is a very small number and adding small numbers rather than multiplying them helps avoid issues with numerical precision. Note that a function and have their maximum at the same point. It’s easy to see how taking logs simplifies calculation: differentiating expressions with is often algebraically easier than those with .
- Maximising is the same as minimising so either implementation is fine, and similarly multiplying functions by positive constants doesn’t our estimate : and have a maximum at the same point.