What I Learned About SVD from XCS224 Homework

In my recent XCS224 homework, I built a much clearer understanding of Singular Value Decomposition (SVD), both algebraically and geometrically. The short version is that SVD rewrites a matrix into structured pieces that expose the most important directions in the data.

For a matrix $A \in \mathbb{R}^{m \times n}$ , the SVD is:

A = U \Sigma V^T

where:

$U \in \mathbb{R}^{m \times m}$ has orthonormal columns (left singular vectors),
$V \in \mathbb{R}^{n \times n}$ has orthonormal columns (right singular vectors),
$\Sigma \in \mathbb{R}^{m \times n}$ is diagonal (rectangular), with nonnegative singular values $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$ .

If $\text{rank}(A) = r$ , then there are exactly $r$ nonzero singular values. One detail I had to correct in my own wording is that these are singular values, not “singularities.”

The geometric interpretation is what made everything click for me. SVD says that applying $A$ is equivalent to three steps: rotate or reflect by $V^T$ , scale along orthogonal axes by $\Sigma$ , then rotate or reflect by $U$ . This means SVD is not just factorization for computation; it is a way to reveal the intrinsic axes of variation in the data.

In terms of row vectors of $A$ , the matrix $V$ provides an orthonormal basis for directions in the input feature space. The singular values in $\Sigma$ tell us how strongly each direction contributes. Large singular values correspond to directions where the data has strong signal; small singular values are often less informative and can be treated as noise-dominated components.

The most practical part from homework was truncated SVD. Instead of keeping all $r$ nonzero singular values, we keep only the top $k$ , where $k < r$ . Denote these by $U_k, \Sigma_k, V_k$ . Then:

A_k = U_k \Sigma_k V_k^T

is the best rank- $k$ approximation of $A$ in both Frobenius norm and spectral norm. Among all rank- $k$ matrices, this one minimizes reconstruction error.

From a representation-learning perspective, truncated SVD gives a low-dimensional subspace that captures the most important structure. If $a_i$ is a row vector of $A$ , we can project it onto the top- $k$ right singular vectors to get a compact coordinate vector:

z_i = a_i V_k

This is the core idea I now understand better: the original high-dimensional row vectors are mapped into a lower-dimensional latent space spanned by columns of $V_k$ , while preserving as much meaningful variation as possible.

In NLP settings, which is why this matters in XCS224, this is directly connected to building dense representations from sparse co-occurrence-style matrices. SVD can compress a large sparse signal into a smaller dense embedding space, and the top singular directions often encode useful semantic structure. Even before neural embedding methods, this gave a principled linear-algebraic way to discover latent dimensions.

My main takeaways are:

SVD is a structured decomposition $A = U \Sigma V^T$ , not just a mechanical matrix trick.
$U$ and $V$ are orthonormal bases, and $\Sigma$ ranks directions by importance through singular values.
Truncated SVD keeps only the strongest $k$ components, giving dimensionality reduction and denoising.
Projecting rows of $A$ onto $V_k$ gives compact latent representations.
The approximation $A_k$ is theoretically optimal among all rank- $k$ approximations.

Overall, this homework turned SVD from a formula I memorized into a tool I can reason about: identify signal, compress structure, and build useful low-dimensional representations from high-dimensional data.