Probability vs Likelihood¶

Probability and statistics have different meanings, though they are often misinterpreted or used interchangeably. Let's break it down and understand them further.

What is Probability?¶

Probability is a branch of mathematics and statistics concerned with the numerical description of how likely events are to occur. The probability of an event is a number between 0 and 1 — the closer it is to 1, the more likely the event is to occur.

Example¶

Tossing a fair (unbiased) coin. The outcomes are two — heads and tails. Since the coin is fair:

\[ P(\text{Head}) = 0.5,\quad P(\text{Tail}) = 0.5 \]

Now, let's say you toss the same fair coin 10 times and observe 7 heads and 3 tails. Let's compute the probability of this outcome.

This is a Bernoulli experiment repeated n = 10 times, and follows the Binomial distribution. The probability of getting k successes (heads) is:

\[ P(\mathrm{k}) = \binom{n}{k} p^k (1 - p)^{n-k} \]

Substituting in our values (n = 10, k = 7, p = 0.5):

\[ P(7 \text{ heads}) = \binom{10}{7} \cdot 0.5^7 \cdot 0.5^3 = 0.1172 \]

This is how you compute the probability given that the parameter (p = 0.5) is known.

What if the Coin is Biased?¶

Now suppose you toss a biased coin 5 times and observe the outcomes:

3 heads
2 tails

But this time, you don’t know the coin’s probability p of landing heads. So instead of computing probability, we compute the likelihood of different values of p given the observed data.

Probability: Given a model (parameter), how likely is the observed data?
Example:
$$ P(\text{3 Heads} \mid p = 0.5) $$
Likelihood: Given the observed data, how likely is a specific value of the parameter?
Example:
$$ \mathcal{L}(p \mid \text{3 Heads}) $$

\[ \mathcal{L}(p \mid \text{3 Heads}) = \binom{5}{3} \cdot p^3 \cdot (1 - p)^2 \]

We don't know p here, The value of $ p $ that maximizes this function is called the Maximum Likelihood Estimate (MLE).

In our case, the MLE is:

\[ \hat{p} = 0.6,\quad \mathcal{L}(\hat{p}) = 0.3455 \]

Deriving the MLE for Likelihood¶

We are given the likelihood function (excluding the binomial coefficient since it does not affect the MLE):

\[ \mathcal{L}(p) = p^3 (1 - p)^2 \]

Let:

$ u = p^3 $
$ v = (1 - p)^2 $

Then the derivative using the product rule is:

\[ \frac{d{L}}{dp} = \frac{du}{dp} \cdot v + u \cdot \frac{dv}{dp} \]

Now plug in:

\[ \frac{d{L}}{dp} = 3p^2(1 - p)^2 - 2p^3(1 - p) \]

This can also be written equivalently as:

\[ \frac{d{L}}{dp} = p^2(p - 1)(5p - 3) \]

Find Critical Points by setting the derivative to zero:

\[ p^2(1 - p)(3 - 5p) = 0 \]

Solutions:

$ p = 0 $
$ p = 1 $
$ p = \frac{3}{5} $

Select the Maximum, Since $ p = 0 $ and $ p = 1 $ yield zero likelihood, the maximum occurs at:

\[ \hat{p} = \frac{3}{5} = 0.6 \]

This is the Maximum Likelihood Estimate (MLE) for the parameter $ p $.

In simpler terms:¶

Probability evaluates data given the parameter
Likelihood evaluates parameter given the data

Feature	Probability	Likelihood
Viewpoint	Model → Data	Data → Model
Expression	$P(X \mid θ)$	$\mathcal{L}(θ \mid X)$
Normalization	Integrates to 1 over x	Not normalized over θ
Use-case	Predictions, simulations	Parameter inference (e.g. MLE)

Conclusion¶

Probability: Model → Data.
“Given θ, how likely is X?”
Likelihood: Data → Model.
“Given observed X, how plausible is θ?”