← Back to modules

Regression & Correlation

Pearson correlation, OLS regression, residuals, and inference for coefficients.

Regression and correlation are tools for understanding relationships between variables. Correlation measures the strength and direction of a linear relationship; regression builds a predictive model.

Simple linear regression fits a line y^=b0+b1x\hat{y} = b_0 + b_1 x to data, minimizing the sum of squared residuals. It is the workhorse of statistical modeling.

These tools are the starting point for more complex models (multiple regression, logistic regression) that power modern data science.

Correlation

The Pearson correlation coefficient rr measures the strength and direction of the linear association between two quantitative variables: r=(xixˉ)(yiyˉ)(xixˉ)2(yiyˉ)2r = \frac{\sum(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum(x_i-\bar{x})^2\sum(y_i-\bar{y})^2}}. It is dimensionless, always in [1,1][-1,1].

r=+1r = +1: perfect positive linear relationship; r=1r = -1: perfect negative linear relationship; r=0r = 0: no linear relationship (a nonlinear one may still exist). Guidelines: r0.8|r| \geq 0.8 is typically considered strong; 0.5r<0.80.5 \leq |r| < 0.8 moderate; r<0.5|r| < 0.5 weak.

The test of H0:ρ=0H_0: \rho = 0 uses the test statistic t=rn2/1r2t = r\sqrt{n-2}/\sqrt{1-r^2} with n2n-2 degrees of freedom. A statistically significant correlation does not imply a practically important one.

Correlation does not imply causation. A positive rr between ice cream sales and drownings does not mean ice cream causes drownings — both are driven by hot weather (a confounding variable).

Spearman's rank correlation rsr_s is the Pearson correlation of the ranks of xx and yy. It is robust to outliers and appropriate for ordinal data or monotone (but not necessarily linear) relationships.

Helpful?

Simple linear regression

Simple linear regression fits a line y^=b0+b1x\hat{y} = b_0 + b_1 x to data by minimising the sum of squared residuals SSE=(yiy^i)2\text{SSE} = \sum(y_i-\hat{y}_i)^2 — the ordinary least-squares (OLS) criterion.

OLS estimators: slope b1=r(sy/sx)=(xixˉ)(yiyˉ)/(xixˉ)2b_1 = r\cdot(s_y/s_x) = \sum(x_i-\bar{x})(y_i-\bar{y})/\sum(x_i-\bar{x})^2; intercept b0=yˉb1xˉb_0 = \bar{y} - b_1\bar{x}. The line always passes through (xˉ,yˉ)(\bar{x}, \bar{y}).

Interpretation: b1b_1 is the estimated change in yy for a one-unit increase in xx, holding all else equal (in simple regression, 'all else' is trivially satisfied). b0b_0 is the predicted yy when x=0x = 0 — it may lack practical meaning if x=0x = 0 is outside the observed range.

Inference on b1b_1: under the standard assumptions, b1N(β1,σ2/SSx)b_1 \sim N(\beta_1, \sigma^2/\text{SS}_x). The test statistic t=(b10)/SEb1t = (b_1-0)/SE_{b_1} follows tn2t_{n-2}. The 95%95\% CI for β1\beta_1 is b1±tn2SEb1b_1 \pm t^*_{n-2}\cdot SE_{b_1}.

Multiple linear regression extends the model to several predictors: y^=b0+b1x1++bkxk\hat{y} = b_0 + b_1 x_1 + \cdots + b_k x_k. Each bjb_j is the estimated effect of xjx_j holding the other predictors constant. The OLS estimates are given in matrix form by β^=(XTX)1XTy\hat{\boldsymbol{\beta}} = (X^TX)^{-1}X^T\mathbf{y}.

Helpful?

Residuals and model assessment

The residual ei=yiy^ie_i = y_i - \hat{y}_i is the difference between observed and fitted value. By construction, ei=0\sum e_i = 0 and xiei=0\sum x_i e_i = 0.

R2=1SSE/SSTR^2 = 1 - \text{SSE}/\text{SST} is the proportion of total variance in yy explained by the model. For simple linear regression, R2=r2R^2 = r^2. Adjusted R2R^2 penalises for the number of predictors and is used in multiple regression.

A residual plot (residuals vs. y^\hat{y} or vs. xx) should show a random scatter around 00 with roughly constant spread. Systematic patterns indicate violations of linearity or homoscedasticity.

A Q-Q plot of residuals checks for normality. The standard error of the regression s=SSE/(n2)s = \sqrt{\text{SSE}/(n-2)} estimates the typical size of a residual.

Influential observations: leverage measures how unusual an xx-value is; Cook's distance combines leverage and residual size to measure overall influence on the fitted model.

Helpful?

Conditions for regression (LINE)

Linearity: the mean of yy is a linear function of xx. Check with a scatter plot before fitting.

Independence: observations are independent. Violated by time-series data with autocorrelated errors.

Normality: residuals are approximately normally distributed. Required for exact tt and FF inference; less critical for large samples by the CLT.

Equal variance (homoscedasticity): the variance of the residuals is constant across all values of xx. Violated when residuals fan out — check with a residual plot.

Transformations (log, square root) of xx or yy can often fix non-linearity or non-constant variance and restore the LINE conditions.

Helpful?

Extrapolation and limitations

Extrapolation is predicting yy for xx values outside the range of the training data. The linear relationship may break down, making extrapolated predictions unreliable.

Correlation and regression describe association, not causation. Only randomised controlled experiments allow causal claims.

Lurking variables (confounders) can create spurious correlations. Always consider whether a third variable might explain an apparent association.

Helpful?