← Back to modules

Descriptive Statistics

Measures of center, spread, shape, and data visualization.

Descriptive statistics is the art of summarizing raw data into meaningful numbers and visuals. Before you can make predictions or test theories, you need to know what your data looks like.

The two big questions are always the same: where is the center of the data, and how spread out is it? Answering these two questions well tells you most of what you need to know.

We will cover measures of central tendency (mean, median, mode), measures of spread (range, variance, standard deviation), and how to visualize distributions.

Measures of central tendency

The mean (arithmetic average) is xˉ=1ni=1nxi\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i. It uses every data point, which makes it sensitive to extreme values (outliers). One unusually large value can pull the mean far from the typical observation.

The median is the middle value when data is sorted in order. For an even number of observations, it is the average of the two middle values. The median is resistant to outliers: adding one extreme value does not change it much. This makes it the preferred measure of center for skewed data such as income, house prices, or reaction times.

The mode is the most frequently occurring value. A dataset can be unimodal (one mode), bimodal (two modes), or multimodal. The mode is the only measure of center applicable to categorical data — you can find the most common hair colour, but a 'mean hair colour' is meaningless.

Relationship to skewness: for a right-skewed (positively skewed) distribution, the mean is greater than the median, which is greater than the mode. For a left-skewed distribution the order reverses. For a perfectly symmetric distribution, all three coincide.

The weighted mean: xˉw=wixiwi\bar{x}_w = \frac{\sum w_i x_i}{\sum w_i}. Used when different observations have different importances — for example, computing a course grade where assignments, midterm, and final carry different percentage weights.

Helpful?

Measures of spread

The range =maxmin= \max - \min is the simplest measure of spread. It only uses two data points and is badly distorted by a single outlier. If the maximum salary in a sample is a CEO at $5,000,000, the range tells you almost nothing about the typical spread.

Variance measures the average squared deviation from the mean. Population variance: σ2=1Ni=1N(xiμ)2\sigma^2 = \frac{1}{N}\sum_{i=1}^N (x_i-\mu)^2. Sample variance uses n1n-1 in the denominator: s2=1n1i=1n(xixˉ)2s^2 = \frac{1}{n-1}\sum_{i=1}^n(x_i-\bar{x})^2. The n1n-1 correction (Bessel's correction) makes s2s^2 an unbiased estimator of σ2\sigma^2.

Standard deviation s=s2s = \sqrt{s^2} restores the original units. A standard deviation of 1010 kg means the typical observation is about 1010 kg away from the mean. Unlike variance, you can directly compare the standard deviation to the data values.

The interquartile range IQR=Q3Q1\text{IQR} = Q_3 - Q_1 captures the spread of the middle 50%50\% of the data. It is unaffected by the most extreme values and is the natural measure of spread to pair with the median.

The coefficient of variation CV=s/xˉ\text{CV} = s/\bar{x} expresses the standard deviation as a fraction of the mean. It allows comparing spread between datasets measured on different scales — e.g., heights of adults vs. heights of buildings.

Helpful?

Data visualization

Histograms group continuous data into bins (intervals) and plot the frequency or density of each bin. They reveal the shape of a distribution: symmetric, right-skewed, left-skewed, bimodal, or uniform. The shape guides which summary statistics are appropriate.

Box plots (box-and-whisker plots) display five numbers: minimum, Q1Q_1, median, Q3Q_3, maximum, and mark outliers as individual points. They make it easy to compare distributions across multiple groups, and they immediately reveal skewness and outliers.

Dot plots and stem-and-leaf plots show every individual value and are useful for small datasets (fewer than ~30 observations). They preserve more information than histograms.

Scatter plots show the joint distribution of two quantitative variables. The pattern — linear, curved, no pattern, tight, loose — guides the choice of model and whether correlation is appropriate.

Skewness and the mean-median relationship: in a right-skewed distribution the mean exceeds the median (the long tail of high values pulls the mean right). In a left-skewed distribution the mean is below the median. For symmetric distributions mean \approx median.

Helpful?

Percentiles and the five-number summary

The pp-th percentile is the value below which approximately p%p\% of observations fall. The median is the 5050th percentile; Q1Q_1 is the 2525th, Q3Q_3 is the 7575th.

The five-number summary {min,Q1,median,Q3,max}\{\min, Q_1, \text{median}, Q_3, \max\} concisely describes the center and spread of a distribution and is the foundation of the box plot.

IQR=Q3Q1\text{IQR} = Q_3 - Q_1 measures the spread of the central half of the data. The standard outlier rule: a value is a (mild) outlier if it lies below Q11.5IQRQ_1 - 1.5\,\text{IQR} or above Q3+1.5IQRQ_3 + 1.5\,\text{IQR}, and an extreme outlier if it exceeds Q13IQRQ_1 - 3\,\text{IQR} or Q3+3IQRQ_3 + 3\,\text{IQR}.

Percentiles are used in standardised testing (your score falls in the 8585th percentile), growth charts (a child's height is in the 6060th percentile), and risk management (the 9999th percentile of losses).

Helpful?

Choosing the right summary

Symmetric distributions with no outliers: use mean and standard deviation. They use all the data and have convenient mathematical properties.

Skewed distributions or data with outliers: use median and IQR. These are resistant to the extreme values that distort the mean and standard deviation.

Always visualise your data before computing summaries. Summary statistics can be deceptive: Anscombe's quartet shows four datasets with identical means, variances, and correlations but completely different shapes.

For categorical or ordinal data, the mode is the only sensible measure of center. Report counts and proportions rather than means.

Helpful?