New Preprint, Low Rank is enough for the MLP Neural Tangent Kernel
In the lazy regime, training deep networks reduces to kernel regression, and the NTK spectrum controls convergence and stability. Low-rank random-feature (RF-LR) architectures freeze random feature maps and train only narrow readouts of dimension $r \ll N$ per layer—reducing parameters from $O(LN^2)$ to $O(LrN)$ while preserving kernel behavior.
What we prove. We take the sequential infinite-width limit $N \to \infty$, then analyze how the remaining randomness concentrates as $r \to \infty$. Main results:
-
Explicit NTK recursion with a visible $1/r$ factor at each bottleneck layer and a closed-form expansion at any depth.
-
Sharp depth scaling for the deterministic proxy: correlations align as $O(k^{-2})$ (same as MLPs at edge of chaos), kernel magnitude saturates, diagonal–off-diagonal gap decays as $\asymp 1/(rk)$, and $\kappa \ge \Omega(r \cdot L)$ in general; for equicorrelated or high-dimensional spherical data, $\kappa_\perp = 1$ or $1+o(1)$.
-
RKHS preservation (three layers): the mean RF-LR kernel induces the same RKHS as the shallow ReLU kernel—low rank does not shrink the function class.
-
Proxy–empirical concentration: rigorous bound for equicorrelated data; $1/\sqrt{r}$ sub-Gaussian concentration of the empirical kernel around the proxy.
Under a fixed budget $O(NLr)$, depth and rank trade off, and from a conditioning perspective they commute. Numerical experiments (in the appendix) confirm depth scaling, conditioning bounds, and proxy–empirical concentration.
Paper: Low Rank is enough for the MLP Neural Tangent Kernel. Joint work with Haizhao Yang and Shijun Zhang. Submitted.