Regression and correlation are tools for understanding relationships between variables. Correlation measures the strength and direction of a linear relationship; regression builds a predictive model.
Simple linear regression fits a line to data, minimizing the sum of squared residuals. It is the workhorse of statistical modeling.
These tools are the starting point for more complex models (multiple regression, logistic regression) that power modern data science.
Correlation
The Pearson correlation coefficient measures the strength and direction of the linear association between two quantitative variables: . It is dimensionless, always in .
: perfect positive linear relationship; : perfect negative linear relationship; : no linear relationship (a nonlinear one may still exist). Guidelines: is typically considered strong; moderate; weak.
The test of uses the test statistic with degrees of freedom. A statistically significant correlation does not imply a practically important one.
Correlation does not imply causation. A positive between ice cream sales and drownings does not mean ice cream causes drownings — both are driven by hot weather (a confounding variable).
Spearman's rank correlation is the Pearson correlation of the ranks of and . It is robust to outliers and appropriate for ordinal data or monotone (but not necessarily linear) relationships.
💡Explain it simply
Correlation asks: when one variable goes up, does the other tend to go up too (), go down (), or do its own thing ()? A value of means the two variables move together almost perfectly in a straight-line pattern.
Simple linear regression
Simple linear regression fits a line to data by minimising the sum of squared residuals — the ordinary least-squares (OLS) criterion.
OLS estimators: slope ; intercept . The line always passes through .
Interpretation: is the estimated change in for a one-unit increase in , holding all else equal (in simple regression, 'all else' is trivially satisfied). is the predicted when — it may lack practical meaning if is outside the observed range.
Inference on : under the standard assumptions, . The test statistic follows . The CI for is .
Multiple linear regression extends the model to several predictors: . Each is the estimated effect of holding the other predictors constant. The OLS estimates are given in matrix form by .
💡Explain it simply
Regression fits the 'best straight line' through a scatter plot. 'Best' means the line minimises the total squared vertical distance from the points to the line. The slope tells you the predicted change in per unit increase in .
Computing and interpreting regression coefficients
- Data: , . , .
- .
- .
- . .
- Model: . Interpretation: each unit increase in is associated with a -unit increase in predicted .
Residuals and model assessment
The residual is the difference between observed and fitted value. By construction, and .
is the proportion of total variance in explained by the model. For simple linear regression, . Adjusted penalises for the number of predictors and is used in multiple regression.
A residual plot (residuals vs. or vs. ) should show a random scatter around with roughly constant spread. Systematic patterns indicate violations of linearity or homoscedasticity.
A Q-Q plot of residuals checks for normality. The standard error of the regression estimates the typical size of a residual.
Influential observations: leverage measures how unusual an -value is; Cook's distance combines leverage and residual size to measure overall influence on the fitted model.
💡Explain it simply
Residuals are the errors your model makes — how far each prediction is from the true value. A good model has residuals that look like pure noise: no pattern, roughly the same size everywhere. If the residuals fan out (get bigger for larger ), or show a curve, the model needs revision.
Conditions for regression (LINE)
Linearity: the mean of is a linear function of . Check with a scatter plot before fitting.
Independence: observations are independent. Violated by time-series data with autocorrelated errors.
Normality: residuals are approximately normally distributed. Required for exact and inference; less critical for large samples by the CLT.
Equal variance (homoscedasticity): the variance of the residuals is constant across all values of . Violated when residuals fan out — check with a residual plot.
Transformations (log, square root) of or can often fix non-linearity or non-constant variance and restore the LINE conditions.
💡Explain it simply
The LINE acronym is the checklist for valid regression: the relationship must be Linear, observations Independent, residuals Normally distributed, and the spread of residuals Equal across -values. Violating any condition may invalidate -values and confidence intervals.
Extrapolation and limitations
Extrapolation is predicting for values outside the range of the training data. The linear relationship may break down, making extrapolated predictions unreliable.
Correlation and regression describe association, not causation. Only randomised controlled experiments allow causal claims.
Lurking variables (confounders) can create spurious correlations. Always consider whether a third variable might explain an apparent association.
💡Explain it simply
Using a regression line to predict outside the range you measured is like extending a map you drew by hand: you know the territory you measured, but beyond the edge of your map, things might look completely different.
Common Mistakes to Avoid
- Claiming causation from a regression. Regression shows association, not causation, unless data comes from a randomised experiment.
- Extrapolating far beyond the range of the data. The model is reliable only near the observed range.
- Interpreting as the correlation. : if then , not .
- Ignoring residual plots. A high does not guarantee LINE conditions are met — always check the residuals.
- Comparing across models with different numbers of predictors. Use adjusted or information criteria (AIC, BIC) instead.