What I Learned About SVD from XCS224 Homework

2026-03-09 · Technology · #Machine Learning

In my recent XCS224 homework, I built a much clearer understanding of Singular Value Decomposition (SVD), both algebraically and geometrically. The short version is that SVD rewrites a matrix into structured pieces that expose the most important directions in the data.

For a matrix ARm×nA \in \mathbb{R}^{m \times n}, the SVD is:

A=UΣVTA = U \Sigma V^T

where:

  • URm×mU \in \mathbb{R}^{m \times m} has orthonormal columns (left singular vectors),
  • VRn×nV \in \mathbb{R}^{n \times n} has orthonormal columns (right singular vectors),
  • ΣRm×n\Sigma \in \mathbb{R}^{m \times n} is diagonal (rectangular), with nonnegative singular values σ1σ20\sigma_1 \ge \sigma_2 \ge \cdots \ge 0.

If rank(A)=r\text{rank}(A) = r, then there are exactly rr nonzero singular values. One detail I had to correct in my own wording is that these are singular values, not “singularities.”

The geometric interpretation is what made everything click for me. SVD says that applying AA is equivalent to three steps: rotate or reflect by VTV^T, scale along orthogonal axes by Σ\Sigma, then rotate or reflect by UU. This means SVD is not just factorization for computation; it is a way to reveal the intrinsic axes of variation in the data.

In terms of row vectors of AA, the matrix VV provides an orthonormal basis for directions in the input feature space. The singular values in Σ\Sigma tell us how strongly each direction contributes. Large singular values correspond to directions where the data has strong signal; small singular values are often less informative and can be treated as noise-dominated components.

The most practical part from homework was truncated SVD. Instead of keeping all rr nonzero singular values, we keep only the top kk, where k<rk < r. Denote these by Uk,Σk,VkU_k, \Sigma_k, V_k. Then:

Ak=UkΣkVkTA_k = U_k \Sigma_k V_k^T

is the best rank-kk approximation of AA in both Frobenius norm and spectral norm. Among all rank-kk matrices, this one minimizes reconstruction error.

From a representation-learning perspective, truncated SVD gives a low-dimensional subspace that captures the most important structure. If aia_i is a row vector of AA, we can project it onto the top-kk right singular vectors to get a compact coordinate vector:

zi=aiVkz_i = a_i V_k

This is the core idea I now understand better: the original high-dimensional row vectors are mapped into a lower-dimensional latent space spanned by columns of VkV_k, while preserving as much meaningful variation as possible.

In NLP settings, which is why this matters in XCS224, this is directly connected to building dense representations from sparse co-occurrence-style matrices. SVD can compress a large sparse signal into a smaller dense embedding space, and the top singular directions often encode useful semantic structure. Even before neural embedding methods, this gave a principled linear-algebraic way to discover latent dimensions.

My main takeaways are:

  1. SVD is a structured decomposition A=UΣVTA = U \Sigma V^T, not just a mechanical matrix trick.
  2. UU and VV are orthonormal bases, and Σ\Sigma ranks directions by importance through singular values.
  3. Truncated SVD keeps only the strongest kk components, giving dimensionality reduction and denoising.
  4. Projecting rows of AA onto VkV_k gives compact latent representations.
  5. The approximation AkA_k is theoretically optimal among all rank-kk approximations.

Overall, this homework turned SVD from a formula I memorized into a tool I can reason about: identify signal, compress structure, and build useful low-dimensional representations from high-dimensional data.