The Singular Value Decomposition#

Introduction#

The Singular Value Decomposition or SVD is a matrix factorization that produces similar outputs to eigenvectors and eigenvalues, but can be calculated for general \(m\times n\) matrices where eigenvalues and eigenvectors are only defined for square matrices. There are numerous important applications that rely on the singular value decomposition as well as a number of deep theoretical insights that it provides. First, let’s consider its structure.

The Singular Value Decomposition

The singular value decomposition or SVD of an \(m\times n\) real matrix \(A\) is a factorization

\[ A = U\Sigma V^T \]

where \(U\) is an orthogonal \(m\times m\) matrix, \(\Sigma\) is an \(m\times n\) diagonal matrix, and \(V\) is an \(n\times n\) orthogonal matrix. It is in general not unique.

Proving that such a decomposition exists is relatively straightforward.

Proof: Suppose that \(A\) is an \(m\times n\) matrix of rank \(r\). Recall that \(A^TA\) is an \(n\times n\) symmetric matrix, so (by the Spectral Theorem) it is diagonalizable and its eigenvectors are orthogonal. Moreover, it is possible to show that all of the eigenvalues of \(A^TA\) are greater than or equal to 0 (exercise!), so for \(1\leq i\leq n\), we can set \(\sigma_i = \sqrt{\lambda_i}\). Without loss of generality, assume \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_n\), and order the associated unit eigenvectors \(\mathbf{v}_1, \mathbf{v}_2,\dots,\mathbf{v}_n\) as the columns of an orthogonal matrix \(V\) analogously.

Now for \(1\leq i\leq r\), define \(\mathbf{u}_i = \frac{1}{\sigma_i}A\mathbf{v}_i\). The \(\mathbf{u}_i's\) are orthonormal unit vectors (exercise!), and if \(r \leq m\), we can use Gram-Schmidt to extend the set \(\mathbf{u}_1,\dots,\mathbf{u}_r\) to an orthonormal basis of \(\mathbb{R}^m\) and let \(U\) be the matrix with columns \(\mathbf{u}_i\), \(1\leq i\leq m\).

Finally, define \(\Sigma \in M_{m\times n}(\mathbb{R})\) as the \(m\times n\) matrix with \(\Sigma_{ii} = \sigma_i\) for \(1\leq i\leq \text{min}(m, n)\).

To verify that this produces the desired decomposition, observe that by defining \(\mathbf{u}_i\) as above, we can write \(AV = U\Sigma\), and because \(V\) is orthogonal, \(A = U\Sigma V^T\). \(\blacksquare\)

Computation of the SVD#

Computation of the full SVD is expensive. We will not do any computations by hand here; instead, we will describe the process as it is usually implemented in software. Typically, the SVD is calculated in two steps. First, the matrix under consideration is reduced to a bidiagonal matrix \(B\). A bidiagonal matrix is a band matrix whose only nonzero entries are on the diagonal and either immediately above or immediately below the diagonal. Second, an eigenvalue approximation algorithm such as the \(QR\)-algorithm is applied to \(B^TB\) to obtain the eigenvalues and eigenvectors of \(B^TB\) which can be used to construct \(V\) and \(\Sigma\), and from these the \(\mathbf{u}_i\)’s can be calculated using the formula above.

Assuming \(m\geq n\), the reduction of a matrix to bidiagonal form is an \(\mathcal{O}(mn^2)\) operation (this is just Gaussian elimination, which is \(\mathcal{O}(n^3)\) when the matrix is \(n\times n\) and \(\mathcal{O}(mn^2)\) when the matrix is \(m\times n\)), but once the matrix is in bidiagonal form, the second step is much less expensive: only \(\mathcal{O}(n)\), so that the computational complexity of the decomposition overall is \(\mathcal{O}(mn^2)\).

In addition, in many (most?) practical applications, the full SVD is not required, and truncated or ‘thin’ singular value decompositions can be more efficiently calculated and stored. These are beyond the scope of this discussion, but are quite similar to the ‘thin’ QR decomposition previously discussed.

An Application of the SVD: Condition Numbers#

The Condition Number of a Matrix

The condition number of a real matrix \(A\) is written \(\kappa(A)\) and defined as

\[ \kappa(A) = \frac{\sigma_{max}}{\sigma_{min}}, \]

where \(\sigma_{max}\) denotes the largest singular value of \(A\) and \(\sigma_{min}\) denotes the smallest nonzero singular value of \(A\).

If we think of singular values as playing a role analogous to eigenvalues for nonsquare matrices, we can view the condition number of a matrix as the ratio between the largest and the smallest ‘stretches’ that a matrix applies when it multiplies a vector. Intuitively, a condition number near 1 indicates that these stretches are about equal, and that is a good thing, as will be demonstrated below. Such a matrix is called well-conditioned. A larger condition number (ill-conditioned) indicates a matrix that is closer to being singular than a matrix with a smaller condition number, and \(\kappa(A) = \infty\) if and only if \(A\) is singular.

Example:

import numpy as np
import scipy.linalg as la

A = np.array([[2, 0], [0, 1]])
B = np.array([[1, 1], [1, 1.000001]])

# calculate the singular values of A and B
A_vals = la.svdvals(A)
B_vals = la.svdvals(B)

print(f'Singular values of A: {A_vals}')
print(f'Condition Number of A: {A_vals.max() / A_vals.min()}')
print(f'Singular values of B: {B_vals}')
print(f'Condition Number of B: {B_vals.max() / B_vals.min()}')
Singular values of A: [2. 1.]
Condition Number of A: 2.0
Singular values of B: [2.00000050e+00 4.99999875e-07]
Condition Number of B: 4000002.0003309543

The example above is trivial but illustrates what the condition number is measuring: \(A\) is a diagonal matrix with orthogonal rows/columns, and \(\kappa(A) = 2\). \(B\), on the other hand, has rows/columns that are almost parallel. They are not exactly parallel, so \(B\) is not singular, but it is very close to singular, and the large condition number of \(4000002.0003309543\) reflects this.

Example, continued: Suppose now we work with a vector that has some measurement error. That is, suppose the vector we should be working with is

\[\begin{split} \mathbf{x} = \begin{bmatrix} 1 \\ -1 \end{bmatrix}, \end{split}\]

but because of some measurement error, we are actually working with

\[\begin{split} \mathbf{x} = \begin{bmatrix} 1.001 \\ -0.999 \end{bmatrix}. \end{split}\]

The error in measurement here is quite small; to quantify this, consider the ratio of the norm of the measured \(\mathbf{x}\) to the norm of the actual \(\mathbf{x}\). It is very close to 1:

true_x = np.array([[1], [-1]])
measured_x = np.array([[1.001], [-0.999]])

la.norm(measured_x) / la.norm(true_x)
np.float64(1.0000004999998748)

But observe now how the error propagates via multiplication with a well-conditioned matrix compared to an ill-conditioned matrix:

la.norm(A @ measured_x) / la.norm(A @ true_x)
np.float64(1.0006003198080637)
la.norm(B @ measured_x) / la.norm(B @ true_x)
np.float64(2827.7208135379897)

For the well-conditioned matrix \(A\), the error in the product is not much larger than the error in \(\mathbf{x}\) itself, but for the ill-conditioned matrix \(B\), the error in the product is dramatically larger.

The condition number of the data matrix \(X\) is often considered when using linear regression: if \(X\) is ill-conditioned we expect a ‘brittle’, unstable model where the model’s predictions change dramatically with only slight changes in input, and measurement error can dramatically impact model performance. Equivalently, a high condition number is indicative of multicollinearity in the data; that is, data where there are correlations or linear dependencies in the data.

A Real-World Example#

Let’s go back to the Student Performance Dataset we looked at previously in the chapter on linear regression. Recall the dataset consists of 10,000 student records, with each record containing information about various predictors and a performance index.

Predictor variables:

  • Hours Studied: The total number of hours spent studying by each student.

  • Previous Scores: The scores obtained by students in previous tests.

  • Extracurricular Activities: Whether the student participates in extracurricular activities (Yes or No).

  • Sleep Hours: The average number of hours of sleep the student had per day.

  • Sample Question Papers Practiced: The number of sample question papers the student practiced.

Target Variable:

  • Performance Index: A measure of the overall performance of each student. The performance index represents the student’s academic performance and has been rounded to the nearest integer. The index ranges from 10 to 100, with higher values indicating better performance.

Let’s look at how well-conditioned the data matrix is. First, we will preprocess the data as we did previously.

import pandas as pd

# read the data
df = pd.read_csv('./datasets/Student_Performance.csv')

# one hot the categorical variables
df = pd.get_dummies(df)

# add the intercept column of 1's
df['Intercept'] = np.ones(shape=(10000,))

# drop the target and the redundant Extracurricular column
df = df.drop(['Extracurricular Activities_No', 'Performance Index'], axis=1)  

# convert to NumPy array
X = df.values.astype(np.float64)
# compute the singular values of X
X_vals = la.svdvals(X)

# compute the condition number of X
print(X_vals.max() / X_vals.min())
451.7820193598499

This is a relatively large condition number, but nowhere near our very ill-conditioned \(2\times 2\) example. We should expect a linear regression model here to exhibit some instability.

Now suppose that we had left the Extracurricular Activities_No column in the dataset. Recall that this column was created as a consequence of one-hot encoding the categorical Extracurricular Activities column in the original dataset, and is perfectly correlated with the column Extracurricular Activities_Yes. Let’s look at the condition number of \(X\) if we leave that column in.

# read the data in again
df = pd.read_csv('./datasets/Student_Performance.csv')

# one hot the categorical variables again
df = pd.get_dummies(df)

# add the intercept column of 1's again
df['Intercept'] = np.ones(shape=(10000,))

# drop the target, but leave in the redundant Extracurricular column
df = df.drop(['Performance Index'], axis=1)  

# convert to NumPy array again
X = df.values.astype(np.float64)

# compute the singular values of X
X_vals = la.svdvals(X)

# compute the condition number of X
print(X_vals.max() / X_vals.min())
2.2021183222635708e+16

Leaving that column in results in a very ill-conditioned matrix.