← Back to modules

Statistical Inference: Hypothesis Testing

Null and alternative hypotheses, p-values, Type I/II errors, and t-tests.

Hypothesis testing is a formal framework for making decisions from data. You start with a default assumption (the null hypothesis H0H_0) and ask: is the evidence strong enough to reject it?

The pp-value quantifies the strength of evidence against H0H_0: the probability of seeing results as extreme as yours, assuming H0H_0 is true. Small pp-values mean the data is surprising under H0H_0.

Understanding the logic of hypothesis testing — including Type I and Type II errors — is essential for interpreting any scientific study that reports 'statistical significance.'

The logic of hypothesis testing

The null hypothesis H0H_0 is the default claim (no effect, no difference, status quo). The alternative HaH_a is the claim the investigator wants to support. We begin by assuming H0H_0 is true and ask: how surprising is the observed data under this assumption?

The test statistic measures how many standard errors the sample result is from the value specified by H0H_0. For a mean: z=(xˉμ0)/(σ/n)z = (\bar{x}-\mu_0)/(\sigma/\sqrt{n}) (if σ\sigma known) or t=(xˉμ0)/(s/n)t = (\bar{x}-\mu_0)/(s/\sqrt{n}) (if σ\sigma estimated).

The pp-value is P(XxobsH0)P(X \geq x_\text{obs} \mid H_0) — the probability of observing data at least as extreme as yours, assuming H0H_0 is true. Small pp-values (pαp \leq \alpha) lead to rejection of H0H_0; large pp-values do not.

One-sided vs. two-sided tests: Ha:μ>μ0H_a: \mu > \mu_0 is right-sided (one-tailed, p=P(Z>zobs)p = P(Z > z_\text{obs})); Ha:μ<μ0H_a: \mu < \mu_0 is left-sided; Ha:μμ0H_a: \mu \neq \mu_0 is two-sided (p=2P(Z>zobs)p = 2P(Z > |z_\text{obs}|)). Choose the direction of HaH_a before collecting data.

Equivalence of tests and intervals: reject H0:μ=μ0H_0: \mu = \mu_0 at level α\alpha (two-sided) if and only if the (1α)×100%(1-\alpha)\times 100\% confidence interval for μ\mu does not contain μ0\mu_0. Confidence intervals and tests are dual procedures.

Helpful?

Type I and Type II errors

Type I error (α\alpha): rejecting H0H_0 when it is true (false positive). By choosing α\alpha you directly control the probability of this error. Common choices are α=0.05\alpha = 0.05, 0.010.01, or 0.100.10 depending on the consequences of false rejection.

Type II error (β\beta): failing to reject H0H_0 when it is false (false negative). β\beta is not directly controlled by α\alpha — it depends on nn, the true effect size, and σ\sigma.

Power =1β= 1 - \beta: the probability of correctly detecting a real effect when it exists. Power increases with larger nn, larger true effect size, larger α\alpha, and smaller σ\sigma. The conventional target is 80%80\% power.

The α\alpha-β\beta trade-off: for fixed nn, lowering α\alpha (stricter rejection) increases β\beta (more missed effects). Increasing nn is the only way to improve power without inflating α\alpha.

Sample size calculation: for a one-sample zz-test at significance α\alpha and power 1β1-\beta, n=(zα/2+zβ)2σ2/δ2n = (z_{\alpha/2}+z_\beta)^2\sigma^2/\delta^2, where δ=μaμ0\delta = \mu_a - \mu_0 is the detectable difference.

Helpful?

The t-test family

One-sample tt-test: H0:μ=μ0H_0: \mu = \mu_0, test statistic t=(xˉμ0)/(s/n)tn1t = (\bar{x}-\mu_0)/(s/\sqrt{n}) \sim t_{n-1} under H0H_0. Used when σ\sigma is unknown — which is essentially always.

Two-sample (independent) tt-test: compares μ1\mu_1 and μ2\mu_2. Test statistic t=(xˉ1xˉ2)/s12/n1+s22/n2t = (\bar{x}_1-\bar{x}_2)/\sqrt{s_1^2/n_1+s_2^2/n_2} (Welch's version, does not assume equal variances). Degrees of freedom estimated by the Welch-Satterthwaite formula.

Paired tt-test: for matched pairs (xi,yi)(x_i, y_i), compute differences di=xiyid_i = x_i - y_i and apply a one-sample tt-test on did_i. More powerful than two-sample when pairs are correlated — it removes between-subject variability.

The zz-test uses σ\sigma known; the tt-test estimates σ\sigma with ss. For n30n \geq 30 the difference is negligible in practice. For small nn, the heavier tails of the tt-distribution properly account for the additional uncertainty in estimating σ\sigma.

Helpful?

Chi-square test for a proportion and goodness-of-fit

The chi-square goodness-of-fit test checks whether observed category counts match a hypothesised distribution. χ2=(OiEi)2/Ei\chi^2 = \sum (O_i - E_i)^2/E_i with k1k-1 degrees of freedom (kk = number of categories).

The one-sample proportion zz-test: H0:p=p0H_0: p = p_0, test statistic z=(p^p0)/p0(1p0)/nz = (\hat{p}-p_0)/\sqrt{p_0(1-p_0)/n}. Valid when np010np_0 \geq 10 and n(1p0)10n(1-p_0) \geq 10.

Two-proportion zz-test: z=(p^1p^2)/SEz = (\hat{p}_1-\hat{p}_2)/SE, where SE=p^(1p^)(1/n1+1/n2)SE = \sqrt{\hat{p}(1-\hat{p})(1/n_1+1/n_2)} using the pooled estimate p^=(x1+x2)/(n1+n2)\hat{p} = (x_1+x_2)/(n_1+n_2) under H0:p1=p2H_0: p_1=p_2.

Helpful?

Multiple testing and p-hacking

If you run mm independent tests each at level α\alpha, the probability of at least one false positive is 1(1α)m1 - (1-\alpha)^m. For m=20m=20 and α=0.05\alpha=0.05, this is 10.95200.641 - 0.95^{20} \approx 0.64 — a 64%64\% chance of at least one spurious significant result.

The Bonferroni correction: to keep the family-wise error rate at α\alpha, use α/m\alpha/m for each individual test. It is conservative (too strict when tests are positively correlated).

The False Discovery Rate (FDR): the Benjamini-Hochberg procedure controls the expected fraction of false rejections among all rejections. Less conservative than Bonferroni, preferred for exploratory work.

p-hacking: running many analyses and selectively reporting significant ones. It inflates the Type I error rate and is a major cause of the replication crisis in science.

Helpful?