What is probability?

Probability quantifies how likely an event is to occur. It is the mathematical language of uncertainty, and understanding it is essential for statistics, data science, medicine, finance, and almost any field where decisions are made under incomplete information.

Definition

The probability of an event \(A\) is a number between 0 and 1:

  • \(P(A) = 0\): the event is impossible.
  • \(P(A) = 1\): the event is certain.
  • \(0 < P(A) < 1\): the event may or may not occur.

For a sample space with equally likely outcomes, the classical definition is:

\[P(A) = \frac{\text{number of outcomes favorable to } A}{\text{total number of possible outcomes}}\]

This formula only applies when all outcomes are equally likely. For other situations, see the empirical and subjective definitions below.

Types of probability

Classical probability

Based on equally likely outcomes, derived from the structure of the experiment rather than observation.

Classical probability

A standard deck has 52 cards. The probability of drawing a heart:

\[P(\text{heart}) = \frac{13}{52} = 0.25\]

The probability of drawing an ace:

\[P(\text{ace}) = \frac{4}{52} \approx 0.077\]

No experiment needed: the probabilities follow from the known structure of the deck.

Example icon

Empirical probability

Based on observed data rather than theoretical assumptions. The probability of an event is estimated as the relative frequency with which it has occurred.

\[P(A) \approx \frac{\text{number of times } A \text{ occurred}}{\text{total number of trials}}\]

As the number of trials increases, the empirical probability converges to the true probability (Law of Large Numbers).

Empirical probability

A factory records 2,400 production runs. In 72 of them, a defective batch is produced.

\[P(\text{defective batch}) \approx \frac{72}{2400} = 0.03\]

This 3% estimate is based purely on historical data, with no theoretical model assumed.

Example icon

Subjective probability

A personal degree of belief, not derived from symmetry or data. Different people may assign different probabilities to the same event based on their knowledge and experience. This is the foundation of Bayesian reasoning.

Subjective probability

An experienced surgeon estimates a 90% chance of a successful outcome for a specific patient, based on the patient’s condition, prior similar cases, and clinical judgment. Another surgeon might estimate 85%. Neither is objectively wrong: both are informed beliefs.

Example icon

Frequentist vs Bayesian probability

These two schools define what probability means and lead to fundamentally different approaches to statistical inference.

Frequentist: probability is the long-run relative frequency of an event in infinitely many repetitions of an experiment. Probabilities are objective properties of the world. Parameters are fixed (unknown) constants, not random variables. You cannot assign a probability to a hypothesis.

Bayesian: probability is a degree of belief, updated as evidence accumulates. Parameters can have probability distributions. Prior beliefs are combined with data to produce posterior beliefs via Bayes’ theorem.

Frequentist vs Bayesian: same question, different answers

Question: what is the probability that this specific coin is fair?

Frequentist answer: the coin either is or is not fair. There is no probability to assign to a fixed property of an object. We can only test whether observed data is consistent with fairness.

Bayesian answer: before flipping, we have a prior belief (say, 90% that the coin is fair). After observing 10 heads in 10 flips, we update that belief using Bayes’ theorem to get a much lower posterior probability.

Neither approach is universally correct: they answer different questions.

Example icon

Bayes’ theorem

Bayes’ theorem is the formal rule for updating probabilities when new evidence arrives:

\[P(A \mid B) = \frac{P(B \mid A) \cdot P(A)}{P(B)}\]

where: - \(P(A \mid B)\) is the posterior: probability of \(A\) given that \(B\) occurred. - \(P(A)\) is the prior: probability of \(A\) before observing \(B\). - \(P(B \mid A)\) is the likelihood: probability of observing \(B\) if \(A\) is true. - \(P(B)\) is the marginal probability of observing \(B\) (under all hypotheses).

⚠️ P(A|B) ≠ P(B|A) - a critical asymmetry

Confusing \(P(A|B)\) with \(P(B|A)\) is one of the most common errors in probability, with serious real-world consequences. A classic example:

  • \(P(\text{positive test} \mid \text{has disease}) = 0.99\) (test sensitivity).
  • \(P(\text{has disease} \mid \text{positive test}) = ?\) (what the patient wants to know).

These are not the same number. The second depends critically on how common the disease is in the population. For a rare disease, even a very accurate test can produce mostly false positives. This is calculated via Bayes’ theorem, as shown in the example below.

Bayes' theorem: medical test for a rare disease

A disease affects 1 in 1,000 people (\(P(\text{disease}) = 0.001\)). A test has: - Sensitivity: \(P(\text{positive} \mid \text{disease}) = 0.99\) - Specificity: \(P(\text{negative} \mid \text{no disease}) = 0.95\), so \(P(\text{positive} \mid \text{no disease}) = 0.05\)

A patient tests positive. What is the probability they actually have the disease?

Step 1: compute \(P(\text{positive})\) using the law of total probability:

\[P(\text{pos}) = P(\text{pos} \mid \text{disease}) \cdot P(\text{disease}) + P(\text{pos} \mid \text{no disease}) \cdot P(\text{no disease})\] \[= 0.99 \times 0.001 + 0.05 \times 0.999 = 0.00099 + 0.04995 = 0.05094\]

Step 2: apply Bayes’ theorem:

\[P(\text{disease} \mid \text{pos}) = \frac{0.99 \times 0.001}{0.05094} \approx 0.0194\]

Only about 2% of people who test positive actually have the disease. This counterintuitive result, called the base rate fallacy, occurs because the disease is so rare that even a small false positive rate generates many more false positives than true positives in the population.

Example icon

Key probability rules

Addition rule

For any two events \(A\) and \(B\):

\[P(A \cup B) = P(A) + P(B) - P(A \cap B)\]

The subtraction corrects for double-counting the overlap. If \(A\) and \(B\) are mutually exclusive (\(P(A \cap B) = 0\)):

\[P(A \cup B) = P(A) + P(B)\]

Addition rule

In a group of 100 employees, 40 speak French, 30 speak German, and 10 speak both. What is the probability that a randomly chosen employee speaks French or German?

\[P(F \cup G) = 0.40 + 0.30 - 0.10 = 0.60\]

Example icon

Multiplication rule

For independent events (the occurrence of one does not affect the other):

\[P(A \cap B) = P(A) \times P(B)\]

For dependent events:

\[P(A \cap B) = P(A) \times P(B \mid A)\]

Multiplication rule

A server has a 99% uptime rate per day. What is the probability that it is up on both Monday and Tuesday (assuming independence)?

\[P(\text{up Mon} \cap \text{up Tue}) = 0.99 \times 0.99 = 0.9801\]

For a month of 30 independent days: \(0.99^{30} \approx 0.740\). Only 74% chance of zero downtime in a month.

Example icon

Complement rule

\[P(\bar{A}) = 1 - P(A)\]

Often the easiest way to compute \(P(A)\) is to compute \(1 - P(\bar{A})\).

Complement rule: at least one failure

A system has 5 independent components, each with a 1% failure probability. What is the probability that at least one fails?

Direct calculation requires summing many cases. Using the complement:

\[P(\text{at least one fails}) = 1 - P(\text{none fail}) = 1 - 0.99^5 \approx 1 - 0.951 = 0.049\]

About 5% chance of at least one failure.

Example icon

💡 When to use each rule

  • Addition rule: “at least one of these events occurs”: use \(P(A \cup B)\).
  • Multiplication rule: “all of these events occur”: use \(P(A \cap B)\).
  • Complement rule: “at least one” problems are almost always easier via the complement: \(1 - P(\text{none})\).
  • Bayes’ theorem: “given that we observed this, what is the updated probability?”: any conditional inference problem.