What I Learned About SVD from XCS224 Homework
In my recent XCS224 homework, I built a much clearer understanding of Singular Value Decomposition (SVD), both algebraically and geometrically. The short version is that SVD rewrites a matrix into structured pieces that expose the most important directions in the data.
For a matrix , the SVD is:
where:
- has orthonormal columns (left singular vectors),
- has orthonormal columns (right singular vectors),
- is diagonal (rectangular), with nonnegative singular values .
If , then there are exactly nonzero singular values. One detail I had to correct in my own wording is that these are singular values, not “singularities.”
The geometric interpretation is what made everything click for me. SVD says that applying is equivalent to three steps: rotate or reflect by , scale along orthogonal axes by , then rotate or reflect by . This means SVD is not just factorization for computation; it is a way to reveal the intrinsic axes of variation in the data.
In terms of row vectors of , the matrix provides an orthonormal basis for directions in the input feature space. The singular values in tell us how strongly each direction contributes. Large singular values correspond to directions where the data has strong signal; small singular values are often less informative and can be treated as noise-dominated components.
The most practical part from homework was truncated SVD. Instead of keeping all nonzero singular values, we keep only the top , where . Denote these by . Then:
is the best rank- approximation of in both Frobenius norm and spectral norm. Among all rank- matrices, this one minimizes reconstruction error.
From a representation-learning perspective, truncated SVD gives a low-dimensional subspace that captures the most important structure. If is a row vector of , we can project it onto the top- right singular vectors to get a compact coordinate vector:
This is the core idea I now understand better: the original high-dimensional row vectors are mapped into a lower-dimensional latent space spanned by columns of , while preserving as much meaningful variation as possible.
In NLP settings, which is why this matters in XCS224, this is directly connected to building dense representations from sparse co-occurrence-style matrices. SVD can compress a large sparse signal into a smaller dense embedding space, and the top singular directions often encode useful semantic structure. Even before neural embedding methods, this gave a principled linear-algebraic way to discover latent dimensions.
My main takeaways are:
- SVD is a structured decomposition , not just a mechanical matrix trick.
- and are orthonormal bases, and ranks directions by importance through singular values.
- Truncated SVD keeps only the strongest components, giving dimensionality reduction and denoising.
- Projecting rows of onto gives compact latent representations.
- The approximation is theoretically optimal among all rank- approximations.
Overall, this homework turned SVD from a formula I memorized into a tool I can reason about: identify signal, compress structure, and build useful low-dimensional representations from high-dimensional data.