6/25/2026 at 2:52:20 PM
A related viewpoint is that overparametrization is good because the model is stranded when the Hessian has all positive/zero eigenvalues. If we treat the probability that a particular Hessian eigenvalue turns positive as a Bernoulli process, the chance of all eigenvalues going positive/zero exponentially decreases as the parameter count increasesby cherryteastain
6/25/2026 at 3:04:11 PM
You don't need billions of parameters for that, precisely because the risk of being stuck at a local minimum decreases exponentially with the number of parameters. Right?by david-gpu
6/26/2026 at 11:53:47 AM
I think of it like this. Imagine a network with two inputs and one output. What's happening during training is to orient a set of 2d planes in 3d space. Then for each x and y coordinate you can iterate through those planes and figure out where the normal at (x,y) hits them on the z axis, take the highest of your result and that's your network output.Sometimes the solution needs one of the planes to make a big change to its orientation. and in the process of doing so the fitness will go down. Now if some other plane happened, by luck, to be in a better position to get to that state without lowering the fitness, then it will generally dominate over the other one, and the first one would start to "evolve" to be suppressed by the other one. The first one then becomes (usually, but not always) pretty much dead. Since the network has a fixed number of weights, it can't decide to just "add another plane". It has to make do with a fixed number of them. If too much of them become useless your training can't really recover, you're stuck at that minimum. The overparametrization allows for a bigger chance that some other plane's orientation will happen to be in a position to shadow counterproductive ones.
by vrighter