4/22/2025 at 7:01:42 PM
In another post (https://alexshtf.github.io/2025/03/27/Free-Poly.html) the author fits a degree-10000 (ten thousand!) polynomial using the Legendre basis. The polynomial _doesn't overfit_, demonstrating double descent. "What happened to our overfitting from ML 101 textbooks? There is no regularization. No control of the degree. But “magically” our high degree polynomial is not that bad!"So... are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,...). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?
by ForceBru
4/22/2025 at 11:50:13 PM
The catch is that orthogonal polynomial bases (like Legendre) implicitly regularize by controlling the Riesz representer norm, effectively implementing a form of spectral filtering that penalizes high-frequency components.by ethan_smith
4/22/2025 at 9:22:32 PM
https://arxiv.org/pdf/2503.02113This paper shows that polynomials show most features of deep neural nets, including double descent and ability to memorize entire dataset.
It connects dots there - polynomials there are regularized to be as simple as possible and author argues that hundredths of billions of parameters in modern neural networks work as a regularizers too, they attenuate decisions that "too risky."
I really enjoyed that paper, a gem that puts light everywhere.
by thesz
4/22/2025 at 8:50:47 PM
No. All the textbooks know that polynomials of high degree are numerically dangerous and you need to be careful when handling them.The articles examples only work because the interval 0 to 1 (or -1 to 1) were chosen. For whatever reason the author does not point that out or even acknowledges the fact that had he chosen a larger interval the limitations of floating point arithmetic would have ruined the argument he was trying to make.
10^100 is a very large number and numerically difficult to treat. For whatever reason the author pretends this is not a valid reason to be cautious about high degree polynomials.
by constantcrying
4/22/2025 at 9:53:45 PM
He seems reasonably explicit about this:""
This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features.
""
by heisenzombie
4/22/2025 at 10:22:07 PM
No.This paragraph has nothing to do with numerics. It is about the fact that continuous functions can not be approximated globally by polynomials. So you need to restrict to intervals for reasons of mathematical theory. This is totally unrelated to the numerical issues, which are nowhere even acknowledged.
by constantcrying
4/23/2025 at 1:02:41 AM
But what's the point in acknowledging numerical issues outside of [-1,1] if polynomials do not even work there, as author explicitly notes?by vient
4/23/2025 at 6:05:26 AM
All polynomials "work" globally. That some polynomials form an orthonormal basis over certain intervals is essentially irrelevant.The author does not address the single most important reason why high degrees of polynomials are dangerous. Which is pretty insane to be honest, obviously to be honest you have to at least mention why people tell you to be cautious about high degree polynomials AND point out why your examples circumvent the problem. Anything else is extremely dishonest and misleading.
by constantcrying
4/23/2025 at 2:17:46 AM
Is there a mathematical proof or basis to back what you’re saying?by sheepscreek
4/23/2025 at 6:06:51 AM
That polynomials do not approximate continuous functions globally??That is pretty obvious. Consider that every polynomial grows arbitrarily large, but not every continuous function does.
by constantcrying
4/23/2025 at 8:14:23 PM
There is a simple corollary to Stone-Weierstrass that extends to infinite intervals, but requires the use of rational functions.by gthompson512
4/23/2025 at 9:01:36 AM
Actually they aren't. You never compute high powers of the argument when working with specialized bases.You use the recursive formula that both the Bernstein basis and the orthogonal polynomial bases are endowed with. This is implemented in numpy, so you don't have to do anything yourself. Just call, for example, np.polynomial.legendre.legvander to get the features for the Legendre basis.
And a basis orthogonal over [-1,1] is easily made orthogonal over arbitrary interval. Take p_i to be the i-th legendre polynomial, then the basis composed of q_i(x)=p_i(2(x-a)/(b-a)-1) is orthogonal over [a,b]. Each q_i is itself a polynomial of degree i, but you never use its coefficients explicitly.
There is an entire library for computing with polynomial apptoximants of functions over arbitrary intervala using orthogonal polynomials - Chebfun. The entire scientific and spectral differential equations community knows there are no numerical issues working with high degree polynomials over arbitrary intervals.
The ML community just hasn't caught up.
by toxigun
4/23/2025 at 9:28:32 AM
>The entire scientific and spectral differential equations community knows there are no numerical issues working with high degree polynomials over arbitrary intervals.This is totally wrong. Of course there are enormous numerical problems with high degree polynomials. Computing large powers of large numbers is enormously unstable and needs to be avoided under basically all circumstances, that is what makes dealing with those polynomials difficult and cautioning people against this is obvious correct.
What you described are the ways to deal with those problems. But this isn't what the article does. My problem with the article is the following:
- Does not mention the most important issue with high degree polynomials, namely their numerical instability.
- Gives examples, but does not explain why they circumvent the numerical problems at all. The most important choice made (the interval of 0 to 1) is portrayed as essentially arbitrary, not an absolute necessity.
- Concludes that there are no problems with high degree polynomials, based on the fact that the experiments worked. Not based on the fact that the actual issues were circumvented, leaving the reader with a totally wrong impression of the issue.
This is terrible scholarship and makes for a terrible article. Not acknowledging the issue is a terrible thing to do and not explain why seemingly arbitrary choices are extremely important is another terrible thing to do. The whole article fails at actually portraying the issue at all.
To be clear, I am not saying that this approximation does not work or that with appropriate scaling and the right polynomials these issues can't be mostly circumvented. Or that high degree polynomials "in general" are incalculable. I am saying that this article completely fails to say what the issue is and why the examples work.
by constantcrying
4/23/2025 at 12:02:04 PM
I believe the author assumes that it's clear to the reader that there is a distinction between how a mathematical object is defined, and how it's computationally used. A polynomial can be defined as a power series, but it's not how they are computationally used. In this sense, the author was mistaken.But it's not that the problems are "circumvented", in the sense that it's a kind of a hack or a patch, but they are solved, in the sense that there is a systematic way to correctly compute with polynomials.
by toxigun
4/23/2025 at 12:35:00 PM
>But it's not that the problems are "circumvented", in the sense that it's a kind of a hack or a patch, but they are solved, in the sense that there is a systematic way to correctly compute with polynomials.But the author does not even acknowledge that there is a wrong way. He claims that it is just a "MYTH" that polynomials have huge problems.
Read the article, nowhere does the author acknowledge that these numerical issues exist at all. Nowhere does the author talk about why the specific methods he uses work or even acknowledge that had he used a naive approach and written the polynomials as a power series over a large interval everything would have fallen apart.
Surely the "MYTH" that high degree polynomials are dangerous is not a myth. Concrete examples where naive approaches fail are trivially easy to find.
Again. I think you are missing the point entirely. I am not saying that these Fourier type projections are "wrong" or "bad" or "numerically unstable" or that you can't write an article about those or that high degree polynomials are impossible to work with.
I am saying that if you claim that it is a "MYTH" that high degree polynomials are dangerous, you should mention why people think that is the case and why your method works. Everything else seems totally disingenuous.
by constantcrying
4/23/2025 at 1:39:27 PM
> Computing large powers of large numbers is enormously unstable and needs to be avoided under basically all circumstances, that is what makes dealing with those polynomials difficult and cautioning people against this is obvious correctBut we don’t compute large powers of large numbers? Chebyshev is on [-1, 1]. Your large powers go to zero. And your coefficients almost always decrease as the degree goes up. Then to top it off, you usually compute the sum of your terms in descending order, so that the biggest ones are added last.
by sevensor
4/23/2025 at 1:55:38 PM
Could you please read the post before commenting?"To be clear, I am not saying that this approximation does not work or that with appropriate scaling and the right polynomials these issues can't be mostly circumvented. Or that high degree polynomials "in general" are incalculable. I am saying that this article completely fails to say what the issue is and why the examples work."
by constantcrying
4/23/2025 at 12:33:36 AM
Neural network training is harder when the input range is allowed to deviate from [-1, 1]. The only reason why it sometimes works for neural networks is because the first layer has a chance to normalize it.by hervature
4/23/2025 at 5:40:07 AM
If you have a set of points, why can't you just map them to an interval [0, 1] before fitting the polynomial?by jansan
4/22/2025 at 10:09:01 PM
The degree-10000 polynomial is definitely overfitting - every data point has its own little spike. The truncated curves look kind of nice, but the data points aren't shown on those plots, and the curves aren't very close to them.There are also some enormous numbers hidden away here too, with associated floating point precision problems; the articles show the coefficients are small, but that's because the polynomial basis functions themselves have gigantic numbers in them. The Bernstein basis for degree-100 involves 100 choose 50, which is already > 10^29. You have to be careful calculating these polynomials or bits of your calculation exceed 64-bit floating point range, e.g. factorials of 10000, 2^10000. See the formulas and table in this section: https://en.wikipedia.org/wiki/Legendre_polynomials#Rodrigues...
by mkl
4/23/2025 at 3:02:51 AM
You can probably calculate the value of P_n(x) using the equation:P_n(x) = (2 - 1/n) x P_{n-1}(x) - (1 - 1/n) P_{n-2}(x)
This is numerically stable if x is in [0,1], but I haven't validated that you can get to very high n using floating point.
by scythe
4/22/2025 at 7:28:06 PM
Well, one catch is that you're doing degree ten thousand. That alone already increases your computational costs and can run afoul of numerical error as your numbers start losing precision. You'll have to start worrying about overflow and underflow. Horner's method or another representation of your polynomial might start to become important. You'll also start to notice your CPU spiking more. In essence, this is kind of fitting more and more points until you get rid of every oscillation you don't want.The other catch, which the author mentions, is that extrapolation is still essentially impossible with polynomials. This is easy to see. The highest-degree term will dominate all the other terms by an order of magnitude once you step out of the interval of interest. Every non-constant polynomial grows without bound.
Honestly, though, what the author is doing isn't much different than a Taylor approximation. If you're okay fitting any polynomial in a small neighbourhood of the data, then go whole hog and fit the best possible polynomial: a Taylor polynomial. You wouldn't usually need to go to that high degree to get what you want in a neighbourhood either.
by jordigh
4/23/2025 at 2:15:48 AM
> Horner's method or another representation of your polynomial might start to become important.Horner's is the default way to evaluate a polynomial - and I think you're overstating the cost of evaluation.
> what the author is doing isn't much different than a Taylor approximation
Yes it is. The Taylor approximation is only valid around a point - and it's based on matching the nth order derivative. There are no error metrics. And it requires differentiation to find it.
by grandempire
4/22/2025 at 7:42:50 PM
Taylor is for points. For functions on interval you want Chebyshev interpolation and probably want to throw in some poles to do better if L1 not L2 matters.by wbl
4/22/2025 at 8:28:09 PM
The original paper about the bias variance tradeoff, that the double decent papers targeted, had some specific constraints.1) data availability and computer limited training set sizes. 2) they could simulate infinite datasets.
While challenging for our minds, training set sizes today make it highly likely that the patterns in your test set are similar to concept classes in your training set.
This is very different than saying procedure or random generated test sets, both of which can lead to problems like over fitting with over parameterized networks.
When the chances are that similar patterns exist, the cost of some memorization goes down and is actually somewhat helpful for generalization.
There are obviously more factors at play here, but go look at the double decent papers and their citations to early 90's papers and you will see this.
The low sensitivity of transformers also dramatically helps, with UHAT without CoT only having the expressiveness of TC0, and with log space scratch space having PTIME expressability.
You can view this from autograd requiring a smooth manifold with the ability to approximate global gradient too if that works better for you.
But yes all intros have to simplify concepts, and there are open questions.
by nyrikki
4/22/2025 at 8:16:51 PM
> So... are _all_ introductions to machine learning just extremely wrong here?It's more of a heuristic. Most people have their first experience in Excel, where you can fit a polynomial. Cranking up degree will always improve r2 (since excel doesn't do a holdout), so it's a very common mistake new students make.
It's much more understandable at the beginner level to say "you'll overfit if you crank up the degree" than it is to explain regularization and basises. Later on you can introduce it, but early on it's confusing and distracting to students who might not even know how to solve for an ordinary least squares model.
by StableAlkyne
4/23/2025 at 2:05:05 AM
Numerical analysis is ALL about picking the right tool for the job. Unfortunately, there aren't super methods and you need to know how to classify problems.by grandempire
4/23/2025 at 1:38:20 AM
It's not like they would sabotage competitors by training them wrong, as a joke.by neuroelectron