Descriptive statistics is the art of summarizing raw data into meaningful numbers and visuals. Before you can make predictions or test theories, you need to know what your data looks like.
The two big questions are always the same: where is the center of the data, and how spread out is it? Answering these two questions well tells you most of what you need to know.
We will cover measures of central tendency (mean, median, mode), measures of spread (range, variance, standard deviation), and how to visualize distributions.
Measures of central tendency
The mean (arithmetic average) is . It uses every data point, which makes it sensitive to extreme values (outliers). One unusually large value can pull the mean far from the typical observation.
The median is the middle value when data is sorted in order. For an even number of observations, it is the average of the two middle values. The median is resistant to outliers: adding one extreme value does not change it much. This makes it the preferred measure of center for skewed data such as income, house prices, or reaction times.
The mode is the most frequently occurring value. A dataset can be unimodal (one mode), bimodal (two modes), or multimodal. The mode is the only measure of center applicable to categorical data — you can find the most common hair colour, but a 'mean hair colour' is meaningless.
Relationship to skewness: for a right-skewed (positively skewed) distribution, the mean is greater than the median, which is greater than the mode. For a left-skewed distribution the order reverses. For a perfectly symmetric distribution, all three coincide.
The weighted mean: . Used when different observations have different importances — for example, computing a course grade where assignments, midterm, and final carry different percentage weights.
💡Explain it simply
You and four friends have different amounts of candy: . The mean () is how much each would get if you pooled and shared equally. The median () is what the middle person has when you line up from least to most. The mode () is what the most people have.
The mean is like a balance point: put weights at each value and the mean is where the seesaw balances. One very heavy weight (an outlier) yanks the balance point toward it. The median ignores extreme values and just asks 'who is in the exact middle?'
Mean, median, mode
- Data: .
- Mean: .
- Sorted: . Median: middle value .
- Mode: (appears twice, all others once).
- The mean () and median () differ — the value pulls the mean down slightly.
Measures of spread
The range is the simplest measure of spread. It only uses two data points and is badly distorted by a single outlier. If the maximum salary in a sample is a CEO at $5,000,000, the range tells you almost nothing about the typical spread.
Variance measures the average squared deviation from the mean. Population variance: . Sample variance uses in the denominator: . The correction (Bessel's correction) makes an unbiased estimator of .
Standard deviation restores the original units. A standard deviation of kg means the typical observation is about kg away from the mean. Unlike variance, you can directly compare the standard deviation to the data values.
The interquartile range captures the spread of the middle of the data. It is unaffected by the most extreme values and is the natural measure of spread to pair with the median.
The coefficient of variation expresses the standard deviation as a fraction of the mean. It allows comparing spread between datasets measured on different scales — e.g., heights of adults vs. heights of buildings.
💡Explain it simply
Standard deviation tells you how scattered the data is around the mean. Small means everyone is close to the average (a tight cluster). Large means values are all over the place. Think of the mean as the centre of a target and as the average distance from the bullseye.
Why divide by for sample variance? When you estimate the mean from the same data, the deviations from are systematically a tiny bit smaller than deviations from the true . Dividing by instead of corrects for that bias.
Sample variance and standard deviation
- Data: , , .
- Squared deviations: , , , , .
- Sum of squared deviations .
- Sample variance: .
- Sample standard deviation: .
Data visualization
Histograms group continuous data into bins (intervals) and plot the frequency or density of each bin. They reveal the shape of a distribution: symmetric, right-skewed, left-skewed, bimodal, or uniform. The shape guides which summary statistics are appropriate.
Box plots (box-and-whisker plots) display five numbers: minimum, , median, , maximum, and mark outliers as individual points. They make it easy to compare distributions across multiple groups, and they immediately reveal skewness and outliers.
Dot plots and stem-and-leaf plots show every individual value and are useful for small datasets (fewer than ~30 observations). They preserve more information than histograms.
Scatter plots show the joint distribution of two quantitative variables. The pattern — linear, curved, no pattern, tight, loose — guides the choice of model and whether correlation is appropriate.
Skewness and the mean-median relationship: in a right-skewed distribution the mean exceeds the median (the long tail of high values pulls the mean right). In a left-skewed distribution the mean is below the median. For symmetric distributions mean median.
💡Explain it simply
A histogram is like sorting coloured marbles into trays by size: you instantly see which sizes are most common and how the sizes spread out. A box plot is like a quick summary card showing: here is the lowest score, here is the bottom quarter, here is the middle, here is the top quarter, and here is the top. Outliers get their own dots.
Percentiles and the five-number summary
The -th percentile is the value below which approximately of observations fall. The median is the th percentile; is the th, is the th.
The five-number summary concisely describes the center and spread of a distribution and is the foundation of the box plot.
measures the spread of the central half of the data. The standard outlier rule: a value is a (mild) outlier if it lies below or above , and an extreme outlier if it exceeds or .
Percentiles are used in standardised testing (your score falls in the th percentile), growth charts (a child's height is in the th percentile), and risk management (the th percentile of losses).
💡Explain it simply
If you score in the th percentile on a test, of test-takers scored below you. You didn't necessarily get of the questions right — percentile rank is about where you stand relative to other people, not your raw score.
Choosing the right summary
Symmetric distributions with no outliers: use mean and standard deviation. They use all the data and have convenient mathematical properties.
Skewed distributions or data with outliers: use median and IQR. These are resistant to the extreme values that distort the mean and standard deviation.
Always visualise your data before computing summaries. Summary statistics can be deceptive: Anscombe's quartet shows four datasets with identical means, variances, and correlations but completely different shapes.
For categorical or ordinal data, the mode is the only sensible measure of center. Report counts and proportions rather than means.
💡Explain it simply
Mean and standard deviation work beautifully for nice, symmetric data. But if your data has extreme outliers (like salaries including a billionaire), they are misleading. In that case, use median and IQR — they just describe the typical middle without being dragged off by extremes.
Common Mistakes to Avoid
- Confusing population variance () with sample variance (). In virtually all real applications you have a sample, so use .
- Reporting the mean for heavily skewed data. For income, house prices, or any right-skewed data, the median better represents the typical value.
- Forgetting to sort data before computing the median or quartiles.
- Interpreting standard deviation as a percentage. It has the same units as the raw data. A standard deviation of kg means typical values deviate by kg from the mean, not .
- Using the range as the primary measure of spread. One outlier distorts it completely. Prefer IQR or standard deviation.