Seminar: Signal Processing and Systems
On Principal Component Regression in High Dimension
Principal component regression (PCR) is a classical two-step approach to linear regression, where one first reduces the data dimension by projecting onto its leading principal components, and then performs ordinary least squares regression. We study PCR in an asymptotic high-dimensional regression setting, where the number of data points is proportional to the dimension. Our main deliverables are asymptotically exact limiting formulas for the estimation and prediction risks, which depend in a nuanced way on the eigenvalues of the population covariance, the alignment between the population principal components and the true signal, and the number of selected components.
A key challenge in the high-dimensional regime is that the sample covariance matrix is an inconsistent estimate of The analysis of (random design) linear regression in high dimension typically builds on powerful results from random matrix theory, such as the Marchenko–Pastur law and deterministic equivalents for the resolvent of a sample covariance matrix. However, these standard tools alone are not sufficient for analyzing the prediction risk of PCR. To that end, we leverage and develop somewhat less standard techniques, which, to our knowledge, have not seen wide use in the statistics literature to date: multi-resolvent traces and their associated eigenvector overlap measures. Based on joint work with Alden Green (Stanford). |