HOME

Shapiro-Wilk test

The Shapiro-Wilk test is the most powerful test for normality for small to medium samples. It compares the observed order statistics to what a normal distribution would predict, and summarizes the agreement in a single statistic \(W\) between 0 and 1.

Hypotheses

\(H_0\): the data come from a normally distributed population.

\(H_1\): the data do not come from a normally distributed population.

Reject \(H_0\) when \(p \leq \alpha\). Failing to reject \(H_0\) does not prove normality: it means the data are consistent with normality at the chosen significance level.

The W statistic

The test statistic is:

\[W = \frac{\left(\sum_{i=1}^{n} a_i x_{(i)}\right)^2}{\sum_{i=1}^{n}(x_i - \bar{x})^2}\]

where \(x_{(i)}\) are the ordered sample values and \(a_i\) are coefficients derived from the expected values of order statistics of a standard normal distribution. The denominator is proportional to the sample variance.

\(W\) ranges between 0 and 1. A value close to 1 means the observed quantiles match normal quantiles well. Small values of \(W\) indicate departures from normality.

The p-value is obtained by comparing \(W\) to its exact distribution (computed by Shapiro and Wilk via simulation), not from a standard table.

Examples

Example 1: normally distributed data

Q-Q plot and histogram for normally distributed data with Shapiro-Wilk test result

Points follow the diagonal closely and \(W \approx 1\): no evidence against normality.

Example 2: right-skewed data

Q-Q plot and histogram for right-skewed data showing departure from normality in the Shapiro-Wilk test

Points curve away from the diagonal in the upper tail (right skew). \(W\) is substantially below 1 and \(p < 0.05\): normality is rejected.

Interpreting W and the p-value

A useful way to read the result:

\(W > 0.95\) and \(p > 0.05\): strong consistency with normality.
\(W\) between 0.90 and 0.95 and \(p > 0.05\): mild departure, likely not problematic for robust parametric tests.
\(W < 0.90\) or \(p \leq 0.05\): meaningful departure from normality. Inspect the Q-Q plot to understand the nature of the departure (skewness, heavy tails, outliers).

The Q-Q plot always provides more information than the p-value alone: it shows where the data depart from normality, not just whether they do.

⚠️ Large samples: the test becomes too sensitive

For \(n > 100\), Shapiro-Wilk detects trivial departures from normality that have no practical consequence for parametric tests. A dataset with \(n = 500\) might give \(W = 0.993\) and \(p = 0.012\), rejecting normality despite the data being essentially normal for any practical purpose.

For large samples, rely on the Q-Q plot and consider the robustness of your downstream test. The \(t\)-test and ANOVA are quite robust to mild non-normality when \(n\) is large, thanks to the CLT.

⚠️ Small samples: the test has low power

For \(n < 15\), Shapiro-Wilk rarely rejects \(H_0\) even for clearly non-normal data. A non-significant result with small samples is weak evidence of normality. Again, the Q-Q plot is essential.

Running the test in R

The test is built into base R and requires no packages:

shapiro.test(x)

The output gives \(W\) and the p-value. For a complete normality assessment, complement with a Q-Q plot:

qqnorm(x)
qqline(x)

The Shapiro-Wilk test in R is limited to samples between 3 and 5,000 observations. For \(n > 5,000\), use the Anderson-Darling test from the nortest package: nortest::ad.test(x).

💡 When normality is rejected: what to do next

Rejecting \(H_0\) does not automatically mean parametric tests are invalid. Consider:

The downstream test’s robustness: \(t\)-tests and ANOVA are fairly robust to mild non-normality for \(n \geq 30\).
Transformations: log, square root, or Box-Cox transformations often normalize right-skewed data.
Non-parametric alternatives: Wilcoxon signed-rank instead of one-sample \(t\), Mann-Whitney instead of two-sample \(t\), Kruskal-Wallis instead of ANOVA.
Bootstrap methods: resample-based confidence intervals and tests require no normality assumption.

The goal is not to achieve a non-significant Shapiro-Wilk result but to use a method appropriate for the actual distribution of your data.