Skip to content

Probability vs Likelihood

Probability and statistics have different meanings, though they are often misinterpreted or used interchangeably. Let's break it down and understand them further.


What is Probability?

Probability is a branch of mathematics and statistics concerned with the numerical description of how likely events are to occur. The probability of an event is a number between 0 and 1 — the closer it is to 1, the more likely the event is to occur.

Example

Tossing a fair (unbiased) coin. The outcomes are two — heads and tails. Since the coin is fair:

\[ P(\text{Head}) = 0.5,\quad P(\text{Tail}) = 0.5 \]

Now, let's say you toss the same fair coin 10 times and observe 7 heads and 3 tails. Let's compute the probability of this outcome.

This is a Bernoulli experiment repeated n = 10 times, and follows the Binomial distribution. The probability of getting k successes (heads) is:

\[ P(\mathrm{k}) = \binom{n}{k} p^k (1 - p)^{n-k} \]

Substituting in our values (n = 10, k = 7, p = 0.5):

\[ P(7 \text{ heads}) = \binom{10}{7} \cdot 0.5^7 \cdot 0.5^3 = 0.1172 \]

This is how you compute the probability given that the parameter (p = 0.5) is known.


What if the Coin is Biased?

Now suppose you toss a biased coin 5 times and observe the outcomes:

  • 3 heads
  • 2 tails

But this time, you don’t know the coin’s probability p of landing heads. So instead of computing probability, we compute the likelihood of different values of p given the observed data.

  • Probability: Given a model (parameter), how likely is the observed data?
    Example:
    $$ P(\text{3 Heads} \mid p = 0.5) $$

  • Likelihood: Given the observed data, how likely is a specific value of the parameter?
    Example:
    $$ \mathcal{L}(p \mid \text{3 Heads}) $$

\[ \mathcal{L}(p \mid \text{3 Heads}) = \binom{5}{3} \cdot p^3 \cdot (1 - p)^2 \]

We don't know p here, The value of \( p \) that maximizes this function is called the Maximum Likelihood Estimate (MLE).

In our case, the MLE is:

\[ \hat{p} = 0.6,\quad \mathcal{L}(\hat{p}) = 0.3455 \]

Deriving the MLE for Likelihood

We are given the likelihood function (excluding the binomial coefficient since it does not affect the MLE):

\[ \mathcal{L}(p) = p^3 (1 - p)^2 \]

Let:

  • \( u = p^3 \)
  • \( v = (1 - p)^2 \)

Then the derivative using the product rule is:

\[ \frac{d{L}}{dp} = \frac{du}{dp} \cdot v + u \cdot \frac{dv}{dp} \]

Now plug in:

\[ \frac{d{L}}{dp} = 3p^2(1 - p)^2 - 2p^3(1 - p) \]

This can also be written equivalently as:

\[ \frac{d{L}}{dp} = p^2(p - 1)(5p - 3) \]

Find Critical Points by setting the derivative to zero:

\[ p^2(1 - p)(3 - 5p) = 0 \]

Solutions:

  • \( p = 0 \)
  • \( p = 1 \)
  • \( p = \frac{3}{5} \)

Select the Maximum, Since \( p = 0 \) and \( p = 1 \) yield zero likelihood, the maximum occurs at:

\[ \hat{p} = \frac{3}{5} = 0.6 \]

This is the Maximum Likelihood Estimate (MLE) for the parameter \( p \).


In simpler terms:

  • Probability evaluates data given the parameter
  • Likelihood evaluates parameter given the data
Feature Probability Likelihood
Viewpoint Model → Data Data → Model
Expression \(P(X \mid θ)\) \(\mathcal{L}(θ \mid X)\)
Normalization Integrates to 1 over x Not normalized over θ
Use-case Predictions, simulations Parameter inference (e.g. MLE)

Conclusion

  • Probability: Model → Data.
    “Given θ, how likely is X?”

  • Likelihood: Data → Model.
    “Given observed X, how plausible is θ?”


References