Problem Set 9: PCA and the QR Algorithm

Problem Set 9: PCA and the QR Algorithm#

Part I: Principal Component Analysis (PCA)#

Problem 1: Explained Variance and Dimensionality Reduction#

A researcher performs PCA on a dataset with 8 features. The eigenvalues of the covariance matrix are:

\[ \lambda_1 = 42.5, \quad \lambda_2 = 18.3, \quad \lambda_3 = 9.7, \quad \lambda_4 = 5.2, \lambda_5 = 2.8, \quad \lambda_6 = 1.5, \quad \lambda_7 = 0.6, \quad \lambda_8 = 0.4 \]

(a) Compute the total variance in the original data.

(b) Calculate the percentage of variance explained by each principal component.

(c) How many principal components are needed to retain at least 90% of the variance? At least 95%?

(d) If you reduce the data to 3 dimensions, what is the reconstruction error as a percentage of total variance?

Problem 2: PCA Implementation and Reconstruction (Coding)#

Generate a synthetic dataset with correlation structure as follows:

import numpy as np
np.random.seed(42)
n_samples = 800
# True latent factors
Z = np.random.randn(n_samples, 2)
# Mixing matrix to create 6 observed features
A = np.array([
    [1.5, 0.5],
    [1.2, -0.8],
    [0.9, 1.1],
    [1.8, 0.3],
    [0.7, -1.2],
    [1.4, 0.9]
])
X = Z @ A.T + 0.5 * np.random.randn(n_samples, 6)

(a) Implement PCA from scratch (center data, compute covariance matrix, find eigenvalues/eigenvectors).

(b) Create a scree plot showing explained variance vs. component number.

(c) Project the data onto the first 2 principal components and visualize in 2D.

(d) Reconstruct the original data using only the first \(k\) components for \(k = 1, 2, 3, 4\). For each \(k\), compute the mean squared reconstruction error:

\[ \text{MSE}_k = \frac{1}{n \cdot d}\|\mathbf{X} - \mathbf{X}_{\text{reconstructed}}\|_F^2 \]

Problem 3: Data Whitening and Mahalanobis Distance (Coding)#

The whitening (or sphering) transformation decorrelates data and scales it to unit variance. Given centered data \(\mathbf{X}\) with eigendecomposition \(\mathbf{C} = \mathbf{V}\mathbf{\Lambda}\mathbf{V}^T\), the whitening transformation is:

\[ \mathbf{Z} = \mathbf{X}\mathbf{V}\mathbf{\Lambda}^{-1/2} \]

(a) Explain why the covariance matrix of \(\mathbf{Z}\) is the identity matrix \(\mathbf{I}\).

(b) Write a Python function whiten(X) that returns the whitened data matrix.

(c) Generate test data with correlations:

mean = [0, 0]
cov = [[4, 2.4], [2.4, 2]]
X = np.random.multivariate_normal(mean, cov, 500)

(d) Apply your whitening function and verify that np.cov(Z.T) is approximately the identity.

(e) Plot both the original and whitened data. What geometric transformation has occurred?

Problem 4: Kernel PCA Concept#

Standard PCA finds linear combinations of features. Sometimes nonlinear structure exists in data.

(a) Explain conceptually why applying PCA to \([\mathbf{X}, \mathbf{X}^2]\) (original features plus their squares) could capture nonlinear patterns.

(b) Given 2D data that lies on a circle, would standard PCA find a useful 1D representation? Why or why not?

(c) If we map data to a higher-dimensional space via \(\phi(\mathbf{x})\) and then apply PCA in that space, we get kernel PCA. The key insight is that we never explicitly compute \(\phi(\mathbf{x})\) but only compute inner products \(K(\mathbf{x}_i, \mathbf{x}_j) = \phi(\mathbf{x}_i)^T\phi(\mathbf{x}_j)\). For the polynomial kernel \(K(\mathbf{x}, \mathbf{y}) = (1 + \mathbf{x}^T\mathbf{y})^2\), what is the dimensionality of the implicit feature space for 2D input data?

Part II: QR Algorithm for Eigenvalue Computation#

Problem 5: QR Algorithm Convergence Analysis#

Consider the QR algorithm applied to a symmetric matrix with distinct eigenvalues \(\lambda_1 > \lambda_2 > \lambda_3 > 0\).

(a) The QR algorithm generates a sequence \(\mathbf{A}_0, \mathbf{A}_1, \mathbf{A}_2, \ldots\) where:

\[ \mathbf{A}_k = \mathbf{Q}_k\mathbf{R}_k \quad \text{and} \quad \mathbf{A}_{k+1} = \mathbf{R}_k\mathbf{Q}_k \]

Prove that \(\mathbf{A}_{k+1}\) and \(\mathbf{A}_k\) are similar matrices (have the same eigenvalues).

(b) The rate at which off-diagonal elements converge to zero is approximately geometric with rate \(|\lambda_3/\lambda_2|\) for the \((2,3)\) entry and \(|\lambda_2/\lambda_1|\) for the \((1,2)\) entry. If \(\lambda_1 = 10\), \(\lambda_2 = 5\), \(\lambda_3 = 0.5\), which off-diagonal element converges faster?

(c) What pathological case prevents convergence of the basic QR algorithm? (Hint: consider eigenvalues with equal magnitude.)

Problem 6: QR Algorithm Implementation and Monitoring (Coding)#

Implement the basic QR algorithm and track its convergence behavior.

A = np.array([
    [4, 1, 1],
    [1, 3, 0],
    [1, 0, 2]
], dtype=float)

(a) Write a function that performs \(N\) iterations of the QR algorithm, storing \(\mathbf{A}_k\) at each iteration.

(b) For each iteration, compute the off-diagonal norm:

\[ \text{off}(\mathbf{A}_k) = \sqrt{\sum_{i \neq j} A_k[i,j]^2} \]

Plot this quantity versus iteration number on a log scale.

(c) Extract the diagonal of \(\mathbf{A}_{100}\) and compare with eigenvalues from np.linalg.eigvals(A).

(d) Compute the theoretical convergence rate ratios \(|\lambda_i/\lambda_{i-1}|\) and compare with your observed convergence plot.

Problem 7: Shifted QR Algorithm (Coding)#

The shifted QR algorithm dramatically improves convergence by incorporating a shift parameter \(\mu_k\) at each iteration. Here’s how it works:

Standard QR Iteration:

Factor: \(\mathbf{A}_k = \mathbf{Q}_k\mathbf{R}_k\)
Recombine: \(\mathbf{A}_{k+1} = \mathbf{R}_k\mathbf{Q}_k\)

Shifted QR Iteration:

Choose shift: \(\mu_k\) (various strategies exist)
Factor the shifted matrix: \(\mathbf{A}_k - \mu_k\mathbf{I} = \mathbf{Q}_k\mathbf{R}_k\)
Recombine and add shift back: \(\mathbf{A}_{k+1} = \mathbf{R}_k\mathbf{Q}_k + \mu_k\mathbf{I}\)

Notice that \(\mathbf{A}_{k+1} = \mathbf{Q}_k^T\mathbf{A}_k\mathbf{Q}_k\), so similarity is preserved.

Common shift strategies:

Rayleigh quotient shift: \(\mu_k = \mathbf{A}_k[n,n]\) (bottom-right entry)
Wilkinson shift: More sophisticated, based on the bottom-right \(2 \times 2\) block

(a) Prove that the shifted iteration preserves similarity: show that \(\mathbf{A}_{k+1}\) and \(\mathbf{A}_k\) have the same eigenvalues.

(b) Implement both unshifted and Rayleigh-quotient-shifted QR algorithms.

(c) Test on the matrix:

B = np.array([
    [5, 2, 1, 0],
    [2, 6, 2, 1],
    [1, 2, 7, 2],
    [0, 1, 2, 8]
], dtype=float)

(d) For both algorithms, plot off-diagonal norm vs. iteration (run 50 iterations). How much faster does the shifted version converge?

Problem 8: Hessenberg Reduction Complexity#

Before applying the QR algorithm in practice, matrices are reduced to Hessenberg form: upper triangular plus one subdiagonal.

\[\begin{split}\mathbf{H} = \begin{bmatrix} * & * & * & * \\ * & * & * & * \\ 0 & * & * & * \\ 0 & 0 & * & * \end{bmatrix}\end{split}\]

(a) Using Householder reflections, reducing an \(n \times n\) matrix to Hessenberg form requires \(O(n^3)\) operations. How many flops does one QR iteration require for:

A general \(n \times n\) matrix?
An \(n \times n\) Hessenberg matrix?

(Hint: QR factorization of Hessenberg matrices can be done in \(O(n^2)\) instead of \(O(n^3)\).)

(b) If you need to compute all eigenvalues and the QR algorithm converges in \(k\) iterations, what is the total complexity of:

Direct QR on the original matrix: \(O(kn^3)\)
Hessenberg reduction + QR on Hessenberg form: \(O(n^3 + kn^2)\)

(c) For \(n = 1000\) and \(k = 100\), estimate the speedup factor.

Problem 9: The Implicit Q Theorem#

The Implicit Q Theorem is foundational to understanding modern QR implementations.

Theorem: Let \(\mathbf{A}\) be an \(n \times n\) matrix, and suppose \(\mathbf{Q} = [\mathbf{q}_1, \ldots, \mathbf{q}_n]\) and \(\mathbf{Z} = [\mathbf{z}_1, \ldots, \mathbf{z}_n]\) are orthogonal matrices such that:

Both \(\mathbf{Q}^T\mathbf{A}\mathbf{Q}\) and \(\mathbf{Z}^T\mathbf{A}\mathbf{Z}\) are in Hessenberg form
\(\mathbf{q}_1 = \mathbf{z}_1\) (same first column)

Then \(\mathbf{Q} = \mathbf{Z}\) up to signs of columns.

(a) Explain in your own words why this theorem matters for the QR algorithm. (Hint: it allows “implicit” computation of the QR factorization.)

(b) In the shifted QR algorithm with shift \(\mu\), the first column of \(\mathbf{Q}\) in \((\mathbf{A} - \mu\mathbf{I}) = \mathbf{Q}\mathbf{R}\) is determined by \(\mathbf{A}\) and \(\mu\). How does the implicit Q theorem allow us to perform a shifted QR step without explicitly forming \(\mathbf{A} - \mu\mathbf{I}\)?

(c) Why is Hessenberg form “essentially unique” given the first column of the similarity transformation?

Problem 10: Full Eigendecomposition via QR (Coding)#

The QR algorithm can compute both eigenvalues AND eigenvectors by tracking the accumulated orthogonal transformations.

Algorithm:

Set Q_total = I (identity matrix)
Set A_current = A
For k = 1 to max_iterations:
    Compute A_current = Q_k * R_k
    Set A_current = R_k * Q_k
    Update Q_total = Q_total * Q_k

At convergence, the columns of \(\mathbf{Q}_{\text{total}}\) are approximate eigenvectors.

(a) Implement this algorithm with the Rayleigh quotient shift.

(b) Test on the matrix:

C = np.array([
    [6, 2, 1],
    [2, 3, 1],
    [1, 1, 1]
], dtype=float)

(c) After convergence:

Extract eigenvalues from the diagonal of the final \(\mathbf{A}_k\)
Extract eigenvectors as columns of \(\mathbf{Q}_{\text{total}}\)
Verify that \(\mathbf{A}\mathbf{v}_i \approx \lambda_i \mathbf{v}_i\) for each eigenpair

(d) Compare your results with scipy.linalg.eigh(C). Compute the error in eigenvalues and the error in eigenvector directions (use the angle between vectors).

(e) How many iterations were required for the off-diagonal norm to drop below \(10^{-10}\)?