Non-Parametric Statistics — Free Statistics Lesson

Non-parametric tests make no assumption about the shape of the population distribution. They are the tool of choice when data is ordinal, heavily skewed, or when normality cannot be assumed.

The chi-square tests handle categorical data: the goodness-of-fit test checks whether observed frequencies match a theoretical distribution; the test for independence checks whether two categorical variables are associated.

Rank-based tests (Mann-Whitney U, Wilcoxon signed-rank) are non-parametric alternatives to $t$ -tests. By converting data to ranks, they become robust to outliers and distributional assumptions.

Chi-square goodness-of-fit test

The goodness-of-fit test checks whether observed category frequencies match a hypothesised distribution. For $k$ categories with observed counts $O_i$ and expected counts $E_i = n\cdot p_i$ (where $p_i$ are the hypothesised probabilities), the test statistic is $\chi^2 = \sum_{i=1}^k \frac{(O_i-E_i)^2}{E_i}$ .

Under $H_0$ , $\chi^2 \sim \chi^2_{k-1}$ (degrees of freedom $= k-1$ ; subtract one more for each parameter estimated from the data, e.g., if $\mu$ and $\sigma$ are estimated for a normal fit).

The test is always right-tailed: a large $\chi^2$ means the observed counts deviate substantially from expected. A small $\chi^2$ (not significant) means the data is consistent with $H_0$ .

Validity condition: all expected counts $E_i \geq 5$ . If some are too small, merge adjacent categories or use the exact multinomial test.

Applications: testing whether a die is fair, whether genetic ratios follow Mendelian predictions, or whether a sample comes from a specified distribution.

💡Explain it simply

Roll a die $600$ times. If it's fair, you expect $100$ of each face. The $\chi^2$ statistic sums up how far off each actual count is from $100$ , scaled by $100$ . A large sum means the die is probably loaded; a small sum means the counts are consistent with fairness.

Goodness-of-fit for a fair die

Roll a die $120$ times. Observed counts: $\{25, 17, 15, 23, 19, 21\}$ . $H_0$ : fair die ( $p_i=1/6$ for all $i$ ).
Expected: $E_i = 120/6 = 20$ for each face.
$\chi^2 = \frac{(25-20)^2}{20}+\frac{(17-20)^2}{20}+\frac{(15-20)^2}{20}+\frac{(23-20)^2}{20}+\frac{(19-20)^2}{20}+\frac{(21-20)^2}{20}$
$= \frac{25+9+25+9+1+1}{20} = \frac{70}{20} = 3.5$ . $df=5$ .
$\chi^2_{0.05,5}=11.07$ . Since $3.5 < 11.07$ , fail to reject $H_0$ . The data is consistent with a fair die.

Helpful?

Chi-square test for independence

A two-way contingency table displays the joint frequencies of two categorical variables. The test of independence asks $H_0$ : the two variables are independent (knowing one gives no information about the other).

Under independence, the expected count in cell $(i,j)$ is $E_{ij} = (\text{row}_i \text{ total}) \times (\text{col}_j \text{ total})/n$ . The test statistic is $\chi^2 = \sum_{i,j} (O_{ij}-E_{ij})^2/E_{ij}$ with $(r-1)(c-1)$ degrees of freedom.

The chi-square test detects association but not direction or causation. For a $2\times 2$ table, the odds ratio or relative risk quantifies the strength of association beyond just significance.

Fisher's exact test: for small samples where $\chi^2$ is unreliable (expected counts $< 5$ ), compute the exact probability of the observed table and all more extreme ones. It is exact regardless of sample size.

The $\chi^2$ test for independence is equivalent to the $z$ -test for comparing two proportions in a $2\times 2$ table: $z^2 = \chi^2$ .

💡Explain it simply

You survey men and women about their coffee preference (tea or coffee). A contingency table records the counts. The test asks: does gender have anything to do with the drink preference? If men and women prefer coffee at the same rate, the variables are independent and $\chi^2$ will be small.

Chi-square test for independence

Contingency table (Gender vs. Preference): Coffee: Men $30$ , Women $20$ ; Tea: Men $10$ , Women $40$ . $n=100$ .
Row totals: Coffee $50$ , Tea $50$ . Column totals: Men $40$ , Women $60$ .
Expected counts: $E_{\text{Men,Coffee}} = 50(40)/100=20$ ; $E_{\text{Women,Coffee}} = 50(60)/100=30$ ; $E_{\text{Men,Tea}} = 20$ ; $E_{\text{Women,Tea}} = 30$ .
$\chi^2 = (30-20)^2/20 + (20-30)^2/30 + (10-20)^2/20 + (40-30)^2/30 = 5+3.33+5+3.33=16.67$ .
$df=(2-1)(2-1)=1$ . $\chi^2_{0.05,1}=3.84$ . Since $16.67>3.84$ , reject $H_0$ . Gender and drink preference are associated.

Helpful?

Rank-based tests: Mann-Whitney and Wilcoxon

The Mann-Whitney U test (Wilcoxon rank-sum test) is the non-parametric alternative to the two-sample $t$ -test. It tests whether one group tends to produce larger values than the other — formally, whether $P(X_1 > X_2) = 0.5$ .

Procedure: combine both groups, rank all observations, sum the ranks for each group. The test statistic $U = R_1 - n_1(n_1+1)/2$ where $R_1$ is the rank sum for group $1$ .

The Wilcoxon signed-rank test is the paired version. Compute differences $d_i = x_i - y_i$ , rank their absolute values, and compare the sum of positive ranks to the sum of negative ranks. It is the non-parametric alternative to the paired $t$ -test.

Rank-based tests are resistant to outliers because they use only the ordering of observations, not their actual values. A single extreme outlier changes only one rank.

Efficiency: when the normal distribution holds, the Mann-Whitney U test has about $95.5\%$ efficiency relative to the $t$ -test (i.e., it needs about $5\%$ more observations to achieve the same power). For non-normal data, it can be more powerful than the $t$ -test.

💡Explain it simply

Instead of comparing actual test scores between two teaching methods, rank all students from $1$ (lowest score) to $n$ (highest score). Then ask: do students from Method A tend to have higher ranks? By using ranks instead of raw values, you don't need to assume anything about the distribution of scores.

Helpful?

Kruskal-Wallis and Spearman correlation

The Kruskal-Wallis test is the non-parametric equivalent of one-way ANOVA. It tests whether $k$ independent groups have the same distribution (equivalently, the same median). It ranks all $N$ observations and compares rank sums across groups.

Test statistic: $H = \frac{12}{N(N+1)}\sum_{j=1}^k \frac{R_j^2}{n_j} - 3(N+1)$ , where $R_j$ is the rank sum for group $j$ . Under $H_0$ , $H \approx \chi^2_{k-1}$ .

If Kruskal-Wallis is significant, use pairwise Mann-Whitney tests with Bonferroni correction for post-hoc comparisons.

Spearman's rank correlation $r_s$ : compute ranks of $x$ and $y$ separately, then apply the Pearson correlation formula to the ranks. It measures monotone (not just linear) relationships and is robust to outliers and non-normality.

💡Explain it simply

Kruskal-Wallis is ANOVA with ranks: instead of comparing the actual group means, it compares the average rank positions of each group. Spearman correlation is Pearson correlation with ranks: instead of asking 'do $x$ and $y$ increase together linearly?', it asks 'do the rankings of $x$ and $y$ increase together?'

Helpful?

When to use non-parametric tests

Use non-parametric tests when: the data is ordinal (e.g., satisfaction ratings on a $1$ – $5$ scale); the sample size is small and normality cannot be assumed; there are severe outliers that resist transformation; or the outcome is inherently rank-based.

With large samples and continuous data, parametric tests are usually robust by the CLT. Non-parametric tests are most valuable for small $n$ with non-normal data.

Power trade-off: when parametric assumptions hold, non-parametric tests have somewhat lower power (need larger $n$ to detect the same effect). When assumptions are violated, non-parametric tests are often more powerful.

Parametric vs. non-parametric summary: $t$ -test $\leftrightarrow$ Mann-Whitney U; paired $t$ -test $\leftrightarrow$ Wilcoxon signed-rank; one-way ANOVA $\leftrightarrow$ Kruskal-Wallis; Pearson $r$ $\leftrightarrow$ Spearman $r_s$ .

💡Explain it simply

Non-parametric tests are your fall-back when you can't trust the bell-curve assumption. They're less picky — they work with ranks and counts rather than raw values — but they pay a small price in power when the normal assumption actually holds.

Helpful?

⚠️

Common Mistakes to Avoid

Using chi-square when expected cell counts are below $5$ . Merge adjacent categories or use Fisher's exact test.
Confusing the chi-square test for independence with the Pearson correlation. Chi-square tests association between categorical variables; correlation measures linear association between quantitative ones.
Thinking the Mann-Whitney U test the same null hypothesis as the $t$ -test. It tests stochastic dominance (which group tends to be larger), not equality of means.
Over-using non-parametric tests. When normality holds and $n$ is large, parametric tests are more powerful. Reserve non-parametric tests for situations where parametric assumptions clearly fail.
Applying the Kruskal-Wallis test without post-hoc comparisons after a significant result. Like ANOVA, a significant $H$ only says the groups differ — you need pairwise tests to say which ones.