New Preprint, Low Rank is enough for the NTK
I’m really proud to finally share this work! It’s been a long journey and represents a lot of work—probably more than I initially thought it would take, but I’m super happy with where we ended up. This is about Low-Rank Neural Networks and the Neural Tangent Kernel, which I worked on with Haizhao Yang and Shijun Zhang.
The whole thing started because we were curious about this class of neural architectures that somehow manage to combine the best of both worlds: they have the expressive power of deep networks but with way better computational efficiency. The cool part? They achieve kernel regime behavior with linear parameter scaling instead of the quadratic scaling you normally need. That’s a pretty big deal!
What are we even talking about here?
So, Multi-Component and Multi-Layer Neural Networks (MMNNs), or Random Feature and Low-Rank models (RF-LR) as some people call them, are basically these networks that have been showing up a lot in scientific machine learning lately. They’re kind of amazing because they give you the expressive power of deep networks, can fit high-frequency functions without needing to assume things upfront, and they’re computationally efficient—like, linear-in-width efficient.
The trick is these low-rank bottlenecks at each layer. Imagine your network alternates between wide hidden layers (width $N$) and narrow bottleneck layers (rank $r \ll N$). Here’s the clever part: only the output weights get trained, while the right part stays frozen as random ReLU features. This way you keep all the low-rank structural advantages without needing to mess around with Riemannian gradient descent.
People have been seeing these networks do really well empirically—they achieve global strong-convergence on all sorts of tasks, including neural-network based PDE solving (with and without Physics-Informed losses). The training shows this wild super convergence with first-order optimization and has this distinctive stepwise loss dynamics where you get long plateaus followed by sharp drops. It’s like the loss landscape is fundamentally different from what you see in fully-trainable networks.
The question that kept us up at night
But here’s the thing—despite all these empirical successes, we didn’t really understand why it worked. The convergence properties were just… weird. Training would show this staircase-like behavior and somehow find global minima even when normal neural network training would completely fail. Clearly something was different about the landscape structure of RF-LR compared to fully-trainable networks.
So we asked ourselves: How do low-rank architectures preserve expressivity while simplifying NTK analysis?
This connects to the whole curse of dimensionality problem in deep learning optimization. What we found is that the curse of dimensionality in the NTK’s exponential-in-dimension spectral decay? Yeah, low-rank structure doesn’t actually fix that. But! By playing around with Fisher and Kibble distributions and manipulating these non-integer Puiseux polynomial series of correlation kernels, we managed to figure out the RKHS structure that’s inherent in frozen weights. That was a lot of work, let me tell you.
What we actually figured out
After way too many late nights and way too much coffee, we ended up with three main theoretical results:
Expressivity: Same RKHS, way fewer parameters
In the NTK regime, RF-LR gives you the same reproducing kernel Hilbert space (RKHS) as the shallow ReLU kernel. This blew my mind when we first proved it—depth doesn’t enlarge the RKHS, but you still preserve full expressivity while getting kernel behavior at a linear parameter budget $O(rN)$ instead of the quadratic $O(N^2)$ you’d need with dense layers.
What this means practically: you get the same theoretical guarantees for optimization convergence and generalization as fully-trainable networks, but with way fewer parameters. The rank-$r$ bottleneck acts like a spectral filter, and its NTK-RKHS has the same decay rate as an MLP’s but uses $O(rN)$ instead of $O(N^2)$ parameters. Pretty neat, right?
Concentration: Things actually work predictably
The empirical two-layer NTK concentrates around its deterministic limit $K_\infty(\rho)$ with sub-Gaussian tails in the rank $r$. We got this by combining Fisher–Kibble decoupling with Hanson–Wright concentration inequalities. There’s also this first-order $1/r$ cancellation mechanism we identified that reduces fluctuations even more in structured settings.
Why does this matter? Because it means even at moderate widths, the kernel behavior is predictable and stable. That’s exactly what you need for reliable optimization guarantees. No more guessing games!
Spectrum: Clean spike–bulk structure
At initialization, the NTK Gram matrix has this really clean spike–bulk structure: a rank-one outlier $\lambda_{\text{spike}} \approx n K_\infty(0)$ and a deformed Marchenko–Pastur bulk. The bulk location gets shifted out of zero thanks to the low-rank bottlenecks, which creates a fundamentally different spectral structure compared to fully-trainable networks.
The entrywise fluctuations follow a joint central limit theorem with variance split into angular (Fisher) and radial (Kibble) components. This decomposition gives us a principled way to understand how the kernel matrix behaves under random initialization. It’s one of those things that seems obvious in hindsight but took us forever to figure out.
The technical stuff (that I’m actually proud of)
One thing I’m particularly happy about is that we managed to avoid computing correlation diagrams altogether. Instead, we exploited the independence structure of frozen random features and Fisher–Kibble decoupling. This simplification means future generalizations can probably proceed with fewer assumptions while still maintaining analytical control.
We also derived an explicit NTK recursion for arbitrary depth and a closed-form expansion with $2L$ terms, which clarifies how base and derivative kernels couple across layers. This gives us a tractable starting point for systematic finite-width corrections, organized around the $1/r$ expansion:
\[\tilde{\Theta}^{(L)}(\rho) = K_\infty(\rho) + \sum_{\ell=2}^{L} O(r^{-\ell})\]Each $O(r^{-\ell})$ term aggregates contributions from layer pairs, coming from cross-layer interactions that are kind of reminiscent of ResNet-style additive path contributions. It’s elegant when you see it all come together.
So what does this actually mean?
Let me break it down in plain terms:
-
Linear scaling: You can get kernel regime behavior with $O(rN)$ parameters instead of $O(N^2)$. For a network with width $N=1000$ and rank $r=10$, that’s roughly $10,000$ parameters instead of $1,000,000$—a 100x reduction. That’s huge!
-
Same expressivity: Even with way fewer parameters, you don’t lose expressivity. The RKHS is identical to fully-trainable networks.
-
Predictable behavior: The exponential concentration means the kernel behavior is stable and predictable even at moderate widths, so you can actually use this in real applications.
-
Clean spectral structure: The spike–bulk decomposition gives us a principled framework for understanding optimization dynamics and generalization properties.
Where this could go
This work opens up a bunch of exciting directions:
Finite-width corrections: The $1/r$ expansion framework gives us a natural starting point for systematically figuring out the finite-width loss landscape. If we could precisely enumerate the expansion coefficients and maybe find closed-form resummations, we’d get explicit finite-width formulas.
Extension to general depth: We focused on the three-layer case because, honestly, that’s what we could handle. Extending to arbitrary depth $L$ while keeping explicit formulas is still an open challenge. The hard part is tracking how random correlations propagate through layers while maintaining exponential concentration rates.
Beyond the lazy regime: Right now we’re focusing on the lazy/NTK regime where things are analytically tractable. Understanding how feature learning emerges from deviations from Gaussianity would give us complementary insights beyond the kernel regime.
Experimental validation: We’d love to see a comprehensive experimental program that confirms lazy-kernel predictions, does systematic width sweeps to test the $O(1/n) + O(1/r)$ finite-width corrections, and provides detailed comparisons with MLPs at matched parameter counts.
Unified theory: I’m hopeful that future work will build toward a complete unified theory, kind of like tensor programs but specifically for low-rank random feature architectures. That would give us systematic computational rules for NTK analysis across different low-rank regimes, arbitrary layer depths, and heterogeneous width configurations.
Wrapping up
This work establishes an NTK theory for low-rank random feature architectures that’s both expressive and computationally favorable. RF-LR gives us a clean route to the kernel regime: same RKHS as dense networks, linear parameter budgets $O(rN)$, sub-Gaussian concentration in $r$, and a predictive spike–bulk spectrum.
These results give us a concrete, practically usable theory for RF-LR networks in the lazy/NTK regime. The combination of theoretical tractability and practical efficiency makes low-rank random features a really nice testbed for developing comprehensive theories of neural network behavior.
I’m particularly excited about the potential for this framework to enable new applications in scientific machine learning. Being able to fit high-frequency functions with linear parameter scaling could be transformative. The theoretical foundation we’ve built opens the door to principled architecture design, hyperparameter tuning, and a deeper understanding of generalization in neural networks.
This has been a lot of work, and I’m genuinely proud of what we’ve accomplished. There were definitely moments where I wasn’t sure we’d get here, but seeing it all come together makes it worth it.
Paper details: The full paper is available as a working preprint. You can find it in my publications or directly access the PDF: Low Rank is enough for the MLP Neural Tangent Kernel.
For the complete theoretical development, including all proofs, spectral characterizations, and technical details, check out the full manuscript. We’re currently preparing it for submission to ICML 2026.