Stratified sampling

Stratified sampling divides the population into homogeneous subgroups (strata) and draws a separate sample from each. By ensuring representation of every stratum and reducing within-stratum variance, it is almost always more efficient than simple random sampling for heterogeneous populations.

Why stratify?

SRS can, by chance, over- or under-represent important subgroups. Stratification prevents this and reduces the standard error by exploiting the fact that units within each stratum are more similar to each other than to units in other strata. The gain in efficiency depends on how different the strata means are: the more heterogeneous the population (and homogeneous within strata), the greater the advantage over SRS.

Procedure

Given a population of \(N\) units divided into \(H\) strata of sizes \(N_1, N_2, \ldots, N_H\):

  1. Define mutually exclusive and exhaustive strata (every unit belongs to exactly one stratum).
  2. Allocate the total sample size \(n\) across strata: \(n_1, n_2, \ldots, n_H\) with \(\sum n_h = n\).
  3. Draw a simple random sample of size \(n_h\) independently within each stratum.

The stratified estimator of the population mean is:

\[\bar{y}_{st} = \sum_{h=1}^{H} W_h \bar{y}_h, \qquad W_h = \frac{N_h}{N}\]

where \(\bar{y}_h\) is the sample mean within stratum \(h\) and \(W_h\) is its population weight. Its variance is:

\[\text{Var}(\bar{y}_{st}) = \sum_{h=1}^{H} W_h^2 \cdot \frac{S_h^2}{n_h} \cdot \left(1 - \frac{n_h}{N_h}\right)\]

Allocation methods

Proportional allocation

Each stratum contributes to the sample in proportion to its size:

\[n_h = n \cdot \frac{N_h}{N} = n \cdot W_h\]

Simple to implement and self-weighting (the overall sample mean \(\bar{y}\) is a valid estimator of the population mean without weights). Works well when within-stratum variances are similar across strata.

Optimal (Neyman) allocation

Allocates more observations to strata that are larger and more variable:

\[n_h = n \cdot \frac{N_h S_h}{\sum_{j=1}^{H} N_j S_j}\]

where \(S_h\) is the within-stratum standard deviation. Neyman allocation minimizes \(\text{Var}(\bar{y}_{st})\) for a fixed total \(n\). It requires a prior estimate of \(S_h\) for each stratum (from a pilot study or historical data).

When \(S_h\) is the same for all strata, Neyman allocation reduces to proportional allocation.

Dot plot showing four strata of different sizes with proportional samples drawn from each stratum

Complete example: employee survey

A company with 1,000 employees in three departments wants to survey 100 employees on job satisfaction. The departments are:

Department \(N_h\) \(W_h\) Estimated \(S_h\)
Sales 500 0.50 8.2
Engineering 300 0.30 12.5
HR 200 0.20 5.1

Proportional allocation:

\[n_\text{Sales} = 100 \times 0.50 = 50, \quad n_\text{Eng} = 30, \quad n_\text{HR} = 20\]

Neyman allocation:

\[\sum N_h S_h = 500\times8.2 + 300\times12.5 + 200\times5.1 = 4100 + 3750 + 1020 = 8870\]

\[n_\text{Sales} = 100 \times \frac{4100}{8870} \approx 46, \quad n_\text{Eng} = 100 \times \frac{3750}{8870} \approx 42, \quad n_\text{HR} = 100 \times \frac{1020}{8870} \approx 12\]

Neyman allocation shifts observations from Sales and HR (lower variability) toward Engineering (highest \(S_h\)), where the additional information is most valuable.

Bar chart comparing proportional and Neyman allocation across three departments

Efficiency gain over SRS

The variance of the stratified mean under proportional allocation satisfies:

\[\text{Var}(\bar{y}_{st,\text{prop}}) \leq \text{Var}(\bar{y}_{SRS})\]

with equality only when all stratum means are identical (no benefit from stratification). The gain increases with the between-stratum variance relative to the total variance.

⚠️ Stratification can hurt if strata are poorly defined

Stratified sampling is only beneficial when:

  • Strata are internally homogeneous (low within-stratum variance).
  • Strata differ meaningfully from each other (high between-stratum variance).
  • The stratification variable is correlated with the outcome of interest.

If the strata are defined by a variable unrelated to the outcome (e.g., alphabetical order of surnames for a health outcome), stratification provides no benefit and wastes the effort of managing separate samples. In the worst case, poorly defined strata can increase the administrative burden without reducing the standard error.

💡 Choosing stratification variables

Good stratification variables are those strongly correlated with the outcome. Common choices:

  • Demographic variables (age, gender, region) for population surveys.
  • Size (company revenue, hospital beds) for business or institutional surveys.
  • Historical outcome (previous satisfaction score, last year’s sales) when available.

A useful rule of thumb: if you can predict the outcome moderately well from the stratification variable, stratification will be efficient. If not, SRS or systematic sampling may be simpler and equally effective.