A step-by-step interactive walkthrough of how PCA finds the directions of maximum variance
Click to add points
Click on the canvas to add data points, or use the preset buttons. PCA will find the directions along which your data varies the most.
| Point | x | y |
|---|---|---|
| No data yet | ||
We subtract the mean from each point so the data's centroid moves to the origin (0, 0). This ensures PCA finds directions of variance from the center of the data.
Each point is shifted so the new mean becomes (0, 0).
The symbol Σ (Greek letter "sigma") denotes the covariance matrix:
The matrix is symmetric: Cov(x,y) = Cov(y,x). It encodes the "shape" of the data cloud.
The covariance matrix is built from outer products of centered vectors. Let $\mathbf{p}_i$ denote the i-th point (a 2D vector):
If a centered point has coordinates $(x, y)$, its outer product with itself is a 2×2 matrix:
Summing these matrices over all points gives: diagonal = sum of squares (variance), off-diagonal = sum of products (covariance). Notice that $xy = yx$, so the covariance matrix is always symmetric.
We can think of the covariance matrix as a transformation that shapes data. Starting from a circular cloud of points (Var(x)=Var(y)=1, Cov=0), the matrix stretches, squashes, and rotates it into an ellipse.
Your data: Var(x)=?, Var(y)=?, Cov=? — try these values to see the transformation that produces your data's shape!
We want to find the directions along which our data varies the most. These are called principal components. Mathematically, they turn out to be the eigenvectors of the covariance matrix. But before we can find them, we need to understand a few concepts.
Geometrically, a determinant measures how much a transformation "scales area":
• Det = 0: The matrix squashes 2D space onto a line — it loses a dimension
• Det ≠ 0: The matrix is invertible — no dimension is completely lost
For a 2×2 matrix $\begin{pmatrix} a & b \\ c & d \end{pmatrix}$, the determinant is $ad - bc$. For our covariance matrix Σ:
When det(Σ) = 0: This means Var(x)·Var(y) = Cov(x,y)². The covariance "maxes out" — the variables co-vary as much as they possibly can. This is perfect correlation: all points lie exactly on a line. The data is truly 1-dimensional, just embedded in 2D.
When det(Σ) > 0: The data genuinely spreads in multiple directions. The larger the determinant, the more "2D" the data is.
An eigenvector of a matrix is a special direction: when you apply the matrix transformation, the vector only gets stretched or shrunk, not rotated. The stretching factor is called the eigenvalue (λ, "lambda").
This says: "Applying Σ to vector v just scales it by λ."
For PCA: The eigenvectors of the covariance matrix point along the principal axes of the data ellipse. The eigenvalue λ tells us the variance along that direction. The eigenvector with the largest λ is PC1 (the direction of maximum variance).
Why not just drop x or y? If your data spreads diagonally, dropping x loses half the information and dropping y loses the other half. Instead, we rotate to align with the principal axes first, then drop the direction with the least variance. The eigenvectors tell us exactly how to rotate.
To find λ, we rearrange Σv = λv into (Σ − λI)v = 0. For a non-zero solution v to exist, the matrix (Σ − λI) must be singular (determinant = 0):
Subtracting λ from the diagonal gives:
Applying the determinant formula $ad - bc$:
Expanding gives a quadratic equation in λ:
This quadratic has two solutions — those are our two eigenvalues λ₁ and λ₂.
Trace = Var(x) + Var(y) = λ₁ + λ₂ — the total variance is split between the two principal components.
Det = Var(x)·Var(y) − Cov(x,y)² = λ₁ × λ₂ — the product of eigenvalues.
This is why det ≈ 0 means one eigenvalue is tiny: if λ₁ × λ₂ ≈ 0 but λ₁ + λ₂ is substantial, then one λ must be near zero. That direction has almost no variance — perfect for discarding!
Our equation $\lambda^2 - \text{trace} \cdot \lambda + \text{det} = 0$ is solved by:
Now that we have λ₁ and λ₂, we find each eigenvector by solving (Σ − λI)v = 0:
From the first row: $(\text{Var}(x) - \lambda)v_x + \text{Cov}(x,y) \cdot v_y = 0$
Solving and normalizing gives the principal component directions:
By projecting each point onto PC1, we reduce 2D data to 1D while keeping the direction of maximum variance. The distance from each original point to its projection represents the information "lost" (variance along PC2).
To project point x onto PC1 (eigenvector v₁):
The projection happens in two steps with different purposes:
Example: If score = 2.5 and v₁ = (0.8, 0.6), then:
• Score alone: just "2.5" (1D representation)
• Projected point: 2.5 × (0.8, 0.6) = (2.0, 1.5) (location in 2D)
This is the core tradeoff of PCA: we sacrifice some variance (information) for a lower-dimensional representation.