← Back to modules

Sampling and Data Distributions

Population vs. sample, law of large numbers, central limit theorem, and sampling distributions.

Statistics is fundamentally about inferring properties of a large population from a smaller sample. Understanding how sample statistics vary — their sampling distributions — is what makes that inference valid.

The Law of Large Numbers explains why averages stabilize. The Central Limit Theorem explains why they stabilize to a normal distribution — one of the most remarkable facts in all of mathematics.

These results are the engine behind confidence intervals and hypothesis tests: once you know the sampling distribution of a statistic, you can quantify exactly how confident you should be in your conclusions.

Population vs. sample

A population is the complete set of individuals or observations of interest. A parameter (e.g., μ\mu, σ2\sigma^2, pp) is a fixed numerical characteristic of the population — unknown and usually unknowable without a census.

A sample is a subset of the population. A statistic (e.g., xˉ\bar{x}, s2s^2, p^\hat{p}) is computed from sample data. Statistics are observable but vary from sample to sample — they are random variables.

A statistic θ^\hat{\theta} is an unbiased estimator of parameter θ\theta if E(θ^)=θE(\hat{\theta}) = \theta. The sample mean Xˉ\bar{X} is unbiased for μ\mu. The sample variance s2=1n1(xixˉ)2s^2 = \frac{1}{n-1}\sum(x_i-\bar{x})^2 is unbiased for σ2\sigma^2.

Sampling design: simple random sampling (SRS) gives every sample of size nn equal probability. Stratified sampling divides the population into groups (strata) and draws SRSs from each, improving precision. Cluster sampling and systematic sampling are used when SRS is impractical.

Bias in sampling: a convenience sample (whoever is available) or voluntary response sample systematically misrepresents the population. No amount of statistical analysis can fix a biased design — always randomise.

Helpful?

The Law of Large Numbers

Weak Law of Large Numbers: for i.i.d. random variables X1,X2,X_1, X_2, \ldots with mean μ\mu, the sample mean Xˉn=1ni=1nXi\bar{X}_n = \frac{1}{n}\sum_{i=1}^n X_i converges in probability to μ\mu: for any ε>0\varepsilon > 0, P(Xˉnμ>ε)0P(|\bar{X}_n - \mu| > \varepsilon) \to 0 as nn \to \infty.

The Strong Law of Large Numbers guarantees almost sure convergence: P(Xˉnμ)=1P(\bar{X}_n \to \mu) = 1. Almost every sequence of i.i.d. observations has a time-average that converges to the true mean.

The LLN justifies using observed frequencies as probability estimates, trusting large studies over small ones, and the general principle that more data gives better estimates.

The LLN does not imply the 'gambler's fallacy.' After 1010 coin flips all landing heads, the next flip is still 50/5050/50. The LLN talks about the eventual average, not compensation of past outcomes.

Helpful?

The Central Limit Theorem

Central Limit Theorem (CLT): if X1,,XnX_1, \ldots, X_n are i.i.d. with mean μ\mu and finite variance σ2\sigma^2, then the standardised mean (Xˉμ)/(σ/n)N(0,1)(\bar{X}-\mu)/(\sigma/\sqrt{n}) \to N(0,1) in distribution as nn\to\infty. Equivalently, XˉN(μ,σ2/n)\bar{X} \approx N(\mu, \sigma^2/n) for large nn.

This is remarkable because the result holds regardless of the population's shape — whether it is skewed, bimodal, or uniform. The averaging process irons out all non-normality.

The standard error of the mean is SE=σ/nSE = \sigma/\sqrt{n}. Doubling nn reduces SESE by a factor of 2\sqrt{2}, not 22. To halve the standard error you need to quadruple the sample size.

Rule of thumb: n30n \geq 30 is often sufficient for the normal approximation to be good, but heavily skewed populations may require n100n \geq 100 or more. Always check with a histogram of the data.

The CLT also applies to sums: Sn=XiN(nμ,nσ2)S_n = \sum X_i \approx N(n\mu, n\sigma^2). And to proportions: p^=X/nN(p,p(1p)/n)\hat{p} = X/n \approx N(p, p(1-p)/n) when np10np \geq 10 and n(1p)10n(1-p) \geq 10.

Helpful?

Sampling distribution of the proportion

The sample proportion p^=X/n\hat{p} = X/n (where XX counts successes in nn Bernoulli trials) estimates the population proportion pp.

Mean: E(p^)=pE(\hat{p}) = p (unbiased). Standard error: SE(p^)=p(1p)/nSE(\hat{p}) = \sqrt{p(1-p)/n}.

By the CLT: p^N(p,p(1p)/n)\hat{p} \approx N(p, p(1-p)/n) when np10np \geq 10 and n(1p)10n(1-p) \geq 10.

The sampling distribution of the difference p^1p^2\hat{p}_1 - \hat{p}_2 (from two independent samples) is approximately N(p1p2,p1(1p1)/n1+p2(1p2)/n2)N(p_1-p_2, p_1(1-p_1)/n_1 + p_2(1-p_2)/n_2). This is used for two-proportion zz-tests and zz-intervals.

Helpful?