Lost in Latent Space

If VC dimension predicts generalization, why don’t deep nets fail?

L (learner) and E (expert) discuss.


L: I keep tripping on this. PAC learning says if your hypothesis class has finite VC dimension, you can bound generalization error in terms of VC dim and sample size. And neural nets violate the spirit — they can shatter arbitrarily large datasets, including random labels — yet they generalize anyway. Is the theory wrong?

E: Not wrong. Misapplied. What does VC dimension bound, exactly?

L: Generalization error in the realizable case. The gap between training error and test error, with high probability over the training sample.

E: For which hypothesis in the class?

L: The worst one consistent with the training data.

E: Hold that thought. What does SGD return?

L: A particular hypothesis. Not the worst one in the class.

E: Right. So the VC bound is worst-case over the class. SGD picks a specific point in that class. The bound and reality refer to different objects. Where does that leave us?

L: SGD has implicit bias — it tends toward low-norm solutions, smooth solutions, things like that.

E: Yes. And your effective hypothesis class is “things SGD with this learning rate, this initialization, this batch schedule, applied to this data, will actually return.” That’s much smaller than “all functions representable by this network.”

L: OK but that feels like a hack. We’re saying the theory is fine, but only if we constrain “the class” to whatever SGD happens to do. Isn’t that circular?

E: It would be, if we couldn’t characterize what SGD does. The interesting research over the last decade has been characterizing it — flat minima, neural tangent kernel regime, edge of stability, lottery tickets. None of these are complete, but each is a real partial picture.

L: So VC dim isn’t useless, but it’s vastly too pessimistic for this regime?

E: For overparameterized neural nets specifically, yes. There are tighter tools — Rademacher complexity, PAC-Bayes, margin-based bounds — that get closer. The bigger point: think of VC dim as describing capacity in the abstract, and SGD as a biased sampler over that capacity. Generalization comes from the bias, not the capacity.

L: Then what about double descent? My understanding: as you increase model size past the interpolation threshold, test error gets worse — and then better. That’s not what classical theory predicts at all.

E: It’s exactly what classical theory predicts in the underparameterized regime. The classical U-curve is real and visible up to the interpolation threshold. After that, when the model can perfectly fit the training set, the question becomes: which interpolating function did SGD pick?

L: And bigger models give SGD more room to pick a smooth one?

E: That’s the picture. As model size grows past interpolation, the manifold of zero-training-error solutions grows, and within it, SGD’s implicit bias has more room to find a flat, smooth, low-norm one. So test error decreases.

L: That sounds magical. Why would more parameters lead to smoother solutions?

E: It’s not magical, it’s volumetric. In high dimensions, the volume of “smooth solutions among interpolators” relative to “weird solutions among interpolators” is heavily biased toward smooth, because most random functions in a sufficiently large class look benign in a measure-theoretic sense. SGD samples from this volume non-uniformly, and the prior it imposes happens to favor smooth.

L: So is classical SLT just dead?

E: Far from it. It’s alive in any regime where the model class is fixed and small relative to the data. It governs decision trees, kernel methods with fixed kernel, low-rank logistic regression. In overparameterized deep learning we need a different theoretical regime — and we’re still building it.

L: What’s the most promising line?

E: PAC-Bayes bounds that incorporate the prior implied by SGD + initialization. Dziugaite and Roy’s 2017 paper on non-vacuous PAC-Bayes bounds for neural nets is the canonical “this can actually work” demo. Since then, steady progress on tighter bounds for specific architectures. They’re not as universal as classical SLT but they’re non-vacuous — used to be unimaginable.

L: One more. Where does VC dim still bite us?

E: Adversarial robustness. The hypothesis class effectively contains nasty things, and the test distribution is allowed to put weight wherever it wants. Capacity-based intuitions reassert themselves there. Schmidt et al. 2018 showed adversarially robust generalization needs substantially more sample complexity than standard generalization, in a very VC-flavored way.

L: Interesting — so for the same architecture, the same data, you can be in two regimes simultaneously: vacuous-VC-bound for clean test error, tight-VC-flavored bound for adversarial test error.

E: Yes. The structure that makes clean generalization easy doesn’t transfer.

L: So the punchline is: VC dim was right about the worst case, deep learning operates in a structured non-worst case for ordinary generalization, and the theory now is about characterizing that structure rather than abandoning the framework. But the worst case still asserts itself when the test distribution is adversarial.

E: That’s the punchline. And the next question — which I’d push you to ask — is why SGD has the implicit bias it does. We know it does. Whether that’s a property of the optimization landscape, the loss geometry, the architecture, or the initialization is still open.

L: Strong opinion?

E: The most credible current story is that flat minima have higher Bayesian posterior probability under typical priors, and SGD’s noise is doing approximate posterior sampling. Mandt, Hoffman, and Blei (2017) made this concrete. But you’ll find serious people disagreeing — Dinh et al. “Sharp minima can generalize” pushes back on the flatness-causes-generalization story. Sit with the disagreement before adopting any single position.

L: That sounds like the part actually worth reading next.

E: It is.

What’s next

Three papers anchor the modern picture: Belkin et al. 2019 (“Reconciling modern machine-learning practice and the classical bias–variance trade-off”) for double descent; Dziugaite & Roy 2017 (“Computing nonvacuous generalization bounds for deep nets”) for PAC-Bayes that actually works; Neyshabur et al. 2017 (“Exploring generalization in deep learning”) for the implicit-bias framing. For a textbook treatment, Telgarsky’s lecture notes on deep learning theory (Illinois, public) are dense and excellent. For the flat-minima debate, read Mandt, Hoffman, Blei (2017) and Dinh et al. (2017) back-to-back.