Hypothesis testing is a formal framework for making decisions from data. You start with a default assumption (the null hypothesis ) and ask: is the evidence strong enough to reject it?
The -value quantifies the strength of evidence against : the probability of seeing results as extreme as yours, assuming is true. Small -values mean the data is surprising under .
Understanding the logic of hypothesis testing — including Type I and Type II errors — is essential for interpreting any scientific study that reports 'statistical significance.'
The logic of hypothesis testing
The null hypothesis is the default claim (no effect, no difference, status quo). The alternative is the claim the investigator wants to support. We begin by assuming is true and ask: how surprising is the observed data under this assumption?
The test statistic measures how many standard errors the sample result is from the value specified by . For a mean: (if known) or (if estimated).
The -value is — the probability of observing data at least as extreme as yours, assuming is true. Small -values () lead to rejection of ; large -values do not.
One-sided vs. two-sided tests: is right-sided (one-tailed, ); is left-sided; is two-sided (). Choose the direction of before collecting data.
Equivalence of tests and intervals: reject at level (two-sided) if and only if the confidence interval for does not contain . Confidence intervals and tests are dual procedures.
💡Explain it simply
Hypothesis testing is like a criminal trial. is 'innocent until proven guilty.' The -value is the probability of seeing evidence this strong (or stronger) if the defendant really is innocent. If that probability is very small — below — you convict (reject ). If not, you acquit — but that doesn't mean the defendant is innocent, just that the evidence wasn't strong enough.
One-sample z-test (two-sided)
- A machine fills bottles. Claimed mean: ml, known ml. Sample of gives ml. Test vs at .
- .
- -value .
- Since , reject . Evidence suggests the machine is not filling to ml.
- CI check: . Does not contain — consistent with rejection.
Type I and Type II errors
Type I error (): rejecting when it is true (false positive). By choosing you directly control the probability of this error. Common choices are , , or depending on the consequences of false rejection.
Type II error (): failing to reject when it is false (false negative). is not directly controlled by — it depends on , the true effect size, and .
Power : the probability of correctly detecting a real effect when it exists. Power increases with larger , larger true effect size, larger , and smaller . The conventional target is power.
The - trade-off: for fixed , lowering (stricter rejection) increases (more missed effects). Increasing is the only way to improve power without inflating .
Sample size calculation: for a one-sample -test at significance and power , , where is the detectable difference.
💡Explain it simply
Think of a smoke alarm. Type I error: it goes off when there is no fire (annoying false alarm). Type II error: there is a fire but it stays silent (catastrophic miss). You want both errors to be rare. Making the alarm more sensitive ( larger) catches more fires but also triggers more false alarms. The only way to get both right is a better alarm with more sensors ( larger).
The t-test family
One-sample -test: , test statistic under . Used when is unknown — which is essentially always.
Two-sample (independent) -test: compares and . Test statistic (Welch's version, does not assume equal variances). Degrees of freedom estimated by the Welch-Satterthwaite formula.
Paired -test: for matched pairs , compute differences and apply a one-sample -test on . More powerful than two-sample when pairs are correlated — it removes between-subject variability.
The -test uses known; the -test estimates with . For the difference is negligible in practice. For small , the heavier tails of the -distribution properly account for the additional uncertainty in estimating .
💡Explain it simply
One-sample : 'Is my sample's mean different from a specific value?' Two-sample : 'Do these two groups have different means?' Paired : 'Are before/after measurements different?' Use paired when you can link observations one-to-one — it gives more statistical power.
Paired t-test
- Blood pressure before and after treatment for patients. Differences : .
- . .
- , .
- . Strong evidence treatment lowers blood pressure.
Chi-square test for a proportion and goodness-of-fit
The chi-square goodness-of-fit test checks whether observed category counts match a hypothesised distribution. with degrees of freedom ( = number of categories).
The one-sample proportion -test: , test statistic . Valid when and .
Two-proportion -test: , where using the pooled estimate under .
💡Explain it simply
The proportion -test asks: 'Is the fraction of successes in my sample far enough from my hypothesised value to be surprising?' A -score of means the observed proportion is standard errors below — which, under a normal approximation, happens only of the time by chance.
Multiple testing and p-hacking
If you run independent tests each at level , the probability of at least one false positive is . For and , this is — a chance of at least one spurious significant result.
The Bonferroni correction: to keep the family-wise error rate at , use for each individual test. It is conservative (too strict when tests are positively correlated).
The False Discovery Rate (FDR): the Benjamini-Hochberg procedure controls the expected fraction of false rejections among all rejections. Less conservative than Bonferroni, preferred for exploratory work.
p-hacking: running many analyses and selectively reporting significant ones. It inflates the Type I error rate and is a major cause of the replication crisis in science.
💡Explain it simply
If you test hypotheses at , on average one will be 'significant' just by chance even if none of them are really true. Multiple testing correction is the fix — you demand stronger evidence for each test to keep the overall false-alarm rate at .
Common Mistakes to Avoid
- Interpreting 'fail to reject ' as evidence that is true. Absence of evidence is not evidence of absence.
- Reporting only the -value without an effect size or confidence interval. A -value of tells you the effect is real; a confidence interval tells you how large it is.
- Using a -test when is unknown. Use the -distribution whenever is estimated from the data.
- Choosing the direction of after seeing the data ('data snooping'). The alternative hypothesis must be specified before data collection.
- Running many tests without correction and cherry-picking significant results (p-hacking). Apply Bonferroni or FDR correction for multiple comparisons.