← Back to modules

Orthogonality and Least Squares

Orthogonal projections, Gram-Schmidt process, and least-squares fitting.

Orthogonality — the generalisation of perpendicularity to any number of dimensions — is one of the most powerful structures in linear algebra. Orthogonal vectors carry independent information; orthonormal bases make coordinate computations trivial.

The central application is the least-squares problem: when a system Ax=bA\mathbf{x}=\mathbf{b} has no solution (more equations than unknowns), find the x^\hat{\mathbf{x}} that minimises Ax^b\|A\hat{\mathbf{x}}-\mathbf{b}\|. The answer is the projection of b\mathbf{b} onto the column space of AA.

Gram-Schmidt orthogonalisation and the QR decomposition are the computational engines. They appear in numerical analysis, statistics (regression), signal processing, and machine learning.

Inner product, length, and orthogonality

In Rn\mathbb{R}^n, the inner product (dot product) is uv=uTv=iuivi\mathbf{u}\cdot\mathbf{v} = \mathbf{u}^T\mathbf{v} = \sum_i u_iv_i. The length is v=vv\|\mathbf{v}\|=\sqrt{\mathbf{v}\cdot\mathbf{v}}.

Two vectors are orthogonal if uv=0\mathbf{u}\cdot\mathbf{v}=0. A unit vector has v=1\|\mathbf{v}\|=1. An orthonormal set is orthogonal and every vector has unit length.

The orthogonal complement of a subspace WW is W={v:vw=0 for all wW}W^\perp = \{\mathbf{v}: \mathbf{v}\cdot\mathbf{w}=0\text{ for all }\mathbf{w}\in W\}. Key relationship: (Row(A))=Null(A)(\text{Row}(A))^\perp=\text{Null}(A) and (Col(A))=Null(AT)(\text{Col}(A))^\perp=\text{Null}(A^T).

Every yRn\mathbf{y}\in\mathbb{R}^n can be uniquely split as y=y^+z\mathbf{y} = \hat{\mathbf{y}} + \mathbf{z} where y^W\hat{\mathbf{y}}\in W and zW\mathbf{z}\in W^\perp. This is the orthogonal decomposition theorem.

The Pythagorean theorem in Rn\mathbb{R}^n: if uv\mathbf{u}\perp\mathbf{v}, then u+v2=u2+v2\|\mathbf{u}+\mathbf{v}\|^2=\|\mathbf{u}\|^2+\|\mathbf{v}\|^2. Orthogonality and right angles are the same idea in any dimension.

Helpful?

Orthogonal sets and projections

An orthogonal set of nonzero vectors is automatically linearly independent — each vector points in a direction not reachable by combining the others.

The orthogonal projection of y\mathbf{y} onto a subspace WW is the unique closest point in WW to y\mathbf{y}. The error yy^\mathbf{y}-\hat{\mathbf{y}} is perpendicular to WW: yy^W\mathbf{y}-\hat{\mathbf{y}}\in W^\perp.

If {u1,,up}\{\mathbf{u}_1,\ldots,\mathbf{u}_p\} is an orthonormal basis for WW, the projection formula simplifies to y^=(yu1)u1++(yup)up\hat{\mathbf{y}} = (\mathbf{y}\cdot\mathbf{u}_1)\mathbf{u}_1 + \cdots + (\mathbf{y}\cdot\mathbf{u}_p)\mathbf{u}_p. Each coordinate is just a dot product — no system of equations to solve.

The projection matrix onto WW (when AA has columns forming a basis for WW) is P=A(ATA)1ATP = A(A^TA)^{-1}A^T. Note: P2=PP^2=P (idempotent) and PT=PP^T=P (symmetric).

Helpful?

The Gram-Schmidt process

Gram-Schmidt converts any linearly independent set {x1,,xp}\{\mathbf{x}_1,\ldots,\mathbf{x}_p\} into an orthonormal basis {u1,,up}\{\mathbf{u}_1,\ldots,\mathbf{u}_p\} for the same subspace.

Procedure: for each new vector xk\mathbf{x}_k, subtract its projections onto all previously constructed vj\mathbf{v}_j, then normalise. Formally: vk=xkj=1k1xkvjvjvjvj\mathbf{v}_k = \mathbf{x}_k - \sum_{j=1}^{k-1}\frac{\mathbf{x}_k\cdot\mathbf{v}_j}{\mathbf{v}_j\cdot\mathbf{v}_j}\mathbf{v}_j, then uk=vk/vk\mathbf{u}_k = \mathbf{v}_k/\|\mathbf{v}_k\|.

The QR decomposition: if A=[x1    xp]A=[\mathbf{x}_1\;\cdots\;\mathbf{x}_p], Gram-Schmidt produces A=QRA=QR where Q=[u1    up]Q=[\mathbf{u}_1\;\cdots\;\mathbf{u}_p] has orthonormal columns and RR is upper triangular. QR is the backbone of modern numerical eigenvalue algorithms.

Why subtract projections? Each new vector vk\mathbf{v}_k must be orthogonal to all previous vj\mathbf{v}_j. The projection projvjxk\text{proj}_{\mathbf{v}_j}\mathbf{x}_k is the component of xk\mathbf{x}_k that lies in the direction of vj\mathbf{v}_j. Subtracting it removes all overlap.

Helpful?

Least-squares problems

When Ax=bA\mathbf{x}=\mathbf{b} is inconsistent (no exact solution), the least-squares solution x^\hat{\mathbf{x}} minimises the residual Axb2\|A\mathbf{x}-\mathbf{b}\|^2. It is not an exact solution — it is the best approximation.

Geometric insight: the closest point in Col(A)\text{Col}(A) to b\mathbf{b} is Ax^=b^=projCol(A)bA\hat{\mathbf{x}}=\hat{\mathbf{b}}=\text{proj}_{\text{Col}(A)}\mathbf{b}. The residual bb^\mathbf{b}-\hat{\mathbf{b}} must be orthogonal to Col(A)\text{Col}(A), giving AT(bAx^)=0A^T(\mathbf{b}-A\hat{\mathbf{x}})=\mathbf{0}.

Normal equations: ATAx^=ATbA^TA\hat{\mathbf{x}} = A^T\mathbf{b}. If AA has linearly independent columns (rank(A)=n\text{rank}(A)=n), then ATAA^TA is invertible and x^=(ATA)1ATb\hat{\mathbf{x}}=(A^TA)^{-1}A^T\mathbf{b}.

Linear regression: fitting y=β0+β1xy=\beta_0+\beta_1 x to mm data points (xi,yi)(x_i,y_i) is a least-squares problem with A=(1x11xm)A=\begin{pmatrix}1&x_1\\\vdots&\vdots\\1&x_m\end{pmatrix} and b=(y1ym)\mathbf{b}=\begin{pmatrix}y_1\\\vdots\\y_m\end{pmatrix}. The least-squares solution gives the best-fit line.

The pseudo-inverse: A+=(ATA)1ATA^+ = (A^TA)^{-1}A^T (when columns are independent) satisfies A+A=IA^+A=I and x^=A+b\hat{\mathbf{x}}=A^+\mathbf{b}. The pseudo-inverse generalises the matrix inverse to non-square systems.

Helpful?