Orthogonality and Least Squares

Orthogonality — the generalisation of perpendicularity to any number of dimensions — is one of the most powerful structures in linear algebra. Orthogonal vectors carry independent information; orthonormal bases make coordinate computations trivial.

The central application is the least-squares problem: when a system $A\mathbf{x}=\mathbf{b}$ has no solution (more equations than unknowns), find the $\hat{\mathbf{x}}$ that minimises $\|A\hat{\mathbf{x}}-\mathbf{b}\|$ . The answer is the projection of $\mathbf{b}$ onto the column space of $A$ .

Gram-Schmidt orthogonalisation and the QR decomposition are the computational engines. They appear in numerical analysis, statistics (regression), signal processing, and machine learning.

Inner product, length, and orthogonality

In $\mathbb{R}^n$ , the inner product (dot product) is $\mathbf{u}\cdot\mathbf{v} = \mathbf{u}^T\mathbf{v} = \sum_i u_iv_i$ . The length is $\|\mathbf{v}\|=\sqrt{\mathbf{v}\cdot\mathbf{v}}$ .

Two vectors are orthogonal if $\mathbf{u}\cdot\mathbf{v}=0$ . A unit vector has $\|\mathbf{v}\|=1$ . An orthonormal set is orthogonal and every vector has unit length.

The orthogonal complement of a subspace $W$ is $W^\perp = \{\mathbf{v}: \mathbf{v}\cdot\mathbf{w}=0\text{ for all }\mathbf{w}\in W\}$ . Key relationship: $(\text{Row}(A))^\perp=\text{Null}(A)$ and $(\text{Col}(A))^\perp=\text{Null}(A^T)$ .

Every $\mathbf{y}\in\mathbb{R}^n$ can be uniquely split as $\mathbf{y} = \hat{\mathbf{y}} + \mathbf{z}$ where $\hat{\mathbf{y}}\in W$ and $\mathbf{z}\in W^\perp$ . This is the orthogonal decomposition theorem.

The Pythagorean theorem in $\mathbb{R}^n$ : if $\mathbf{u}\perp\mathbf{v}$ , then $\|\mathbf{u}+\mathbf{v}\|^2=\|\mathbf{u}\|^2+\|\mathbf{v}\|^2$ . Orthogonality and right angles are the same idea in any dimension.

💡Explain it simply

Orthogonal vectors are like perpendicular compass directions — north and east carry completely independent information. Orthonormal vectors add the requirement that each direction has length exactly $1$ . Working in an orthonormal basis is like having a perfectly square grid — every coordinate reads off cleanly with a single dot product.

Helpful?

Orthogonal sets and projections

An orthogonal set of nonzero vectors is automatically linearly independent — each vector points in a direction not reachable by combining the others.

The orthogonal projection of $\mathbf{y}$ onto a subspace $W$ is the unique closest point in $W$ to $\mathbf{y}$ . The error $\mathbf{y}-\hat{\mathbf{y}}$ is perpendicular to $W$ : $\mathbf{y}-\hat{\mathbf{y}}\in W^\perp$ .

If $\{\mathbf{u}_1,\ldots,\mathbf{u}_p\}$ is an orthonormal basis for $W$ , the projection formula simplifies to $\hat{\mathbf{y}} = (\mathbf{y}\cdot\mathbf{u}_1)\mathbf{u}_1 + \cdots + (\mathbf{y}\cdot\mathbf{u}_p)\mathbf{u}_p$ . Each coordinate is just a dot product — no system of equations to solve.

The projection matrix onto $W$ (when $A$ has columns forming a basis for $W$ ) is $P = A(A^TA)^{-1}A^T$ . Note: $P^2=P$ (idempotent) and $P^T=P$ (symmetric).

💡Explain it simply

Projecting $\mathbf{y}$ onto $W$ is like finding your shadow on a surface when the sun is directly overhead. The shadow is the point on the surface closest to you. The vector from shadow to you (the error) is perpendicular to the surface.

Projection onto a plane in $\mathbb{R}^3$

Project $\mathbf{y}=\langle 1,2,3\rangle$ onto $W=\text{span}\{\mathbf{u}_1,\mathbf{u}_2\}$ with $\mathbf{u}_1=\langle 1,0,0\rangle$ , $\mathbf{u}_2=\langle 0,1,0\rangle$ (the $xy$ -plane).
$\hat{\mathbf{y}} = (\mathbf{y}\cdot\mathbf{u}_1)\mathbf{u}_1 + (\mathbf{y}\cdot\mathbf{u}_2)\mathbf{u}_2 = 1\cdot\langle 1,0,0\rangle + 2\cdot\langle 0,1,0\rangle = \langle 1,2,0\rangle$ .
Error: $\mathbf{y}-\hat{\mathbf{y}} = \langle 0,0,3\rangle$ . Check orthogonality to $W$ : $\langle 0,0,3\rangle\cdot\langle 1,0,0\rangle=0$ ✓, $\langle 0,0,3\rangle\cdot\langle 0,1,0\rangle=0$ ✓.

Helpful?

The Gram-Schmidt process

Gram-Schmidt converts any linearly independent set $\{\mathbf{x}_1,\ldots,\mathbf{x}_p\}$ into an orthonormal basis $\{\mathbf{u}_1,\ldots,\mathbf{u}_p\}$ for the same subspace.

Procedure: for each new vector $\mathbf{x}_k$ , subtract its projections onto all previously constructed $\mathbf{v}_j$ , then normalise. Formally: $\mathbf{v}_k = \mathbf{x}_k - \sum_{j=1}^{k-1}\frac{\mathbf{x}_k\cdot\mathbf{v}_j}{\mathbf{v}_j\cdot\mathbf{v}_j}\mathbf{v}_j$ , then $\mathbf{u}_k = \mathbf{v}_k/\|\mathbf{v}_k\|$ .

The QR decomposition: if $A=[\mathbf{x}_1\;\cdots\;\mathbf{x}_p]$ , Gram-Schmidt produces $A=QR$ where $Q=[\mathbf{u}_1\;\cdots\;\mathbf{u}_p]$ has orthonormal columns and $R$ is upper triangular. QR is the backbone of modern numerical eigenvalue algorithms.

Why subtract projections? Each new vector $\mathbf{v}_k$ must be orthogonal to all previous $\mathbf{v}_j$ . The projection $\text{proj}_{\mathbf{v}_j}\mathbf{x}_k$ is the component of $\mathbf{x}_k$ that lies in the direction of $\mathbf{v}_j$ . Subtracting it removes all overlap.

💡Explain it simply

Gram-Schmidt builds a clean coordinate system one axis at a time. The first axis is just the first vector, normalised. The second axis is the second vector with its shadow onto the first axis subtracted — that makes it perpendicular. The third strips out shadows on both previous axes. Each step removes overlap until you have perfectly perpendicular, unit-length axes.

Gram-Schmidt in $\mathbb{R}^3$

Orthogonalise $\mathbf{x}_1=\langle 1,1,0\rangle$ , $\mathbf{x}_2=\langle 1,0,1\rangle$ .
Step 1: $\mathbf{v}_1=\mathbf{x}_1=\langle 1,1,0\rangle$ . Normalise: $\|\mathbf{v}_1\|=\sqrt{2}$ , $\mathbf{u}_1=\langle 1/\sqrt{2},1/\sqrt{2},0\rangle$ .
Step 2: $\text{proj}_{\mathbf{v}_1}\mathbf{x}_2 = \frac{\mathbf{x}_2\cdot\mathbf{v}_1}{\mathbf{v}_1\cdot\mathbf{v}_1}\mathbf{v}_1 = \frac{1}{2}\langle 1,1,0\rangle = \langle 1/2,1/2,0\rangle$ .
$\mathbf{v}_2 = \mathbf{x}_2 - \langle 1/2,1/2,0\rangle = \langle 1/2,-1/2,1\rangle$ .
Normalise: $\|\mathbf{v}_2\|=\sqrt{1/4+1/4+1}=\sqrt{3/2}$ , $\mathbf{u}_2=\mathbf{v}_2/\sqrt{3/2}$ .
Check: $\mathbf{u}_1\cdot\mathbf{u}_2 = 0$ ✓.

Helpful?

Least-squares problems

When $A\mathbf{x}=\mathbf{b}$ is inconsistent (no exact solution), the least-squares solution $\hat{\mathbf{x}}$ minimises the residual $\|A\mathbf{x}-\mathbf{b}\|^2$ . It is not an exact solution — it is the best approximation.

Geometric insight: the closest point in $\text{Col}(A)$ to $\mathbf{b}$ is $A\hat{\mathbf{x}}=\hat{\mathbf{b}}=\text{proj}_{\text{Col}(A)}\mathbf{b}$ . The residual $\mathbf{b}-\hat{\mathbf{b}}$ must be orthogonal to $\text{Col}(A)$ , giving $A^T(\mathbf{b}-A\hat{\mathbf{x}})=\mathbf{0}$ .

Normal equations: $A^TA\hat{\mathbf{x}} = A^T\mathbf{b}$ . If $A$ has linearly independent columns ( $\text{rank}(A)=n$ ), then $A^TA$ is invertible and $\hat{\mathbf{x}}=(A^TA)^{-1}A^T\mathbf{b}$ .

Linear regression: fitting $y=\beta_0+\beta_1 x$ to $m$ data points $(x_i,y_i)$ is a least-squares problem with $A=\begin{pmatrix}1&x_1\\\vdots&\vdots\\1&x_m\end{pmatrix}$ and $\mathbf{b}=\begin{pmatrix}y_1\\\vdots\\y_m\end{pmatrix}$ . The least-squares solution gives the best-fit line.

The pseudo-inverse: $A^+ = (A^TA)^{-1}A^T$ (when columns are independent) satisfies $A^+A=I$ and $\hat{\mathbf{x}}=A^+\mathbf{b}$ . The pseudo-inverse generalises the matrix inverse to non-square systems.

💡Explain it simply

Least squares asks: if I can't hit the target exactly, where should I aim to get as close as possible? The answer is the point in the column space of $A$ closest to $\mathbf{b}$ . The normal equations are the algebraic condition that says 'the error vector points away from the column space' — i.e., is perpendicular to it.

Helpful?

⚠️

Common Mistakes to Avoid

Confusing orthogonal with orthonormal. Orthogonal means perpendicular ( $\mathbf{u}\cdot\mathbf{v}=0$ ). Orthonormal adds the unit-length requirement. The simplified projection formula $\hat{\mathbf{y}}=\sum(\mathbf{y}\cdot\mathbf{u}_k)\mathbf{u}_k$ only works for orthonormal bases.
Forgetting to subtract all previous projections in Gram-Schmidt. Each new vector must be made orthogonal to every previously constructed vector, not just the previous one.
Writing the normal equations as $A\hat{\mathbf{x}}=A^T\mathbf{b}$ or similar. The correct form is $A^TA\hat{\mathbf{x}}=A^T\mathbf{b}$ .
Trying to apply $(A^TA)^{-1}$ when $A$ does not have full column rank. $A^TA$ is invertible iff the columns of $A$ are linearly independent.
Thinking the least-squares solution satisfies $A\hat{\mathbf{x}}=\mathbf{b}$ . It does not — it only satisfies the normal equations.

Inner product, length, and orthogonality

Orthogonal sets and projections

Projection onto a plane in R3\mathbb{R}^3R3

The Gram-Schmidt process

Gram-Schmidt in R3\mathbb{R}^3R3

Least-squares problems

Common Mistakes to Avoid

Projection onto a plane in $\mathbb{R}^3$

Gram-Schmidt in $\mathbb{R}^3$