Deep Learning Is Applied Topology

5/20/2025 at 4:32:22 PM

Since this post is based on my 2014 blog post (https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/ ), I thought I might comment.

I tried really hard to use topology as a way to understand neural networks, for example in these follow ups:

- https://colah.github.io/posts/2014-10-Visualizing-MNIST/

- https://colah.github.io/posts/2015-01-Visualizing-Representa...

There are places I've found the topological perspective useful, but after a decade of grappling with trying to understand what goes on inside neural networks, I just haven't gotten that much traction out of it.

I've had a lot more success with:

* The linear representation hypothesis - The idea that "concepts" (features) correspond to directions in neural networks.

* The idea of circuits - networks of such connected concepts.

Some selected related writing:

- https://distill.pub/2020/circuits/zoom-in/

- https://transformer-circuits.pub/2022/mech-interp-essay/inde...

- https://transformer-circuits.pub/2025/attribution-graphs/bio...

by colah3

5/20/2025 at 4:57:27 PM

Related to ways of understanding neural networks, I've seen these views expressed a lot, which to me seem like misconceptions:

- LLMs are basically just slightly better `n-gram` models

- The idea of "just" predicting the next token, as if next-token-prediction implies a model must be dumb

(I wonder if this [1] popular response to Karpathy's RNN [2] post is partly to blame for people equating language neural nets with n-gram models. The stochastic parrot paper [3] also somewhat equates LLMs and n-gram models, e.g. "although she primarily had n-gram models in mind, the conclusions remain apt and relevant". I guess there was a time where they were more equivalent, before the nets got really really good)

[1] https://nbviewer.org/gist/yoavg/d76121dfde2618422139

[2] https://karpathy.github.io/2015/05/21/rnn-effectiveness/

[3] https://dl.acm.org/doi/pdf/10.1145/3442188.3445922

by montebicyclelo

5/20/2025 at 7:07:35 PM

I guess I'll plug my hobby horse:

The whole discourse of "stochastic parrots" and "do models understand" and so on is deeply unhealthy because it should be scientific questions about mechanism, and people don't have a vocabulary for discussing the range of mechanisms which might exist inside a neural network. So instead we have lots of arguments where people project meaning onto very fuzzy ideas and the argument doesn't ground out to scientific, empirical claims.

Our recent paper reverse engineers the computation neural networks use to answer in a number of interesting cases (https://transformer-circuits.pub/2025/attribution-graphs/bio... ). We find computation that one might informally describe as "multi-step inference", "planning", and so on. I think it's maybe clarifying for this, because it grounds out to very specific empirical claims about mechanism (which we test by intervention experiments).

Of course, one can disagree with the informal language we use. I'm happy for people to use whatever language they want! I think in an ideal world, we'd move more towards talking about concrete mechanism, and we need to develop ways to talk about these informally.

There was previous discussion of our paper here: https://news.ycombinator.com/item?id=43505748

by colah3

5/20/2025 at 10:05:06 PM

1) Isn't it unavoidable that a transformer - a sequential multi-layer architecture - is doing multi-step inference ?!

2) There are two aspects to a rhyming poem:

a) It is a poem, so must have a fairly high degree of thematic coherence

b) It rhymes, so must have end-of-line rhyming words

It seems that to learn to predict (hence generate) a rhyming poem, both of these requirements (theme/story continuation+rhyming) would need to be predicted ("planned") at least by the beginning of the line, since they are inter-related.

In contrast, a genre like freestyle rap may also rhyme, but flow is what matters and thematic coherence and rhyming may suffer as a result. In learning to predict (hence generate) freestyle, an LLM might therefore be expected to learn that genre-specific improv is what to expect, and that rhyming is of secondary importance, so one might expect less rhyme-based prediction ("planning") at the start of each bar (line).

by HarHarVeryFunny

5/21/2025 at 4:16:05 PM

> The whole discourse of "stochastic parrots" and "do models understand" and so on is deeply unhealthy [...] So instead we have lots of arguments where people project meaning onto very fuzzy ideas and the argument doesn't ground out to scientific, empirical claims.

I would put it this way: the question "do LLMs, etc understand?" is rooted in a category mistake.

Meaning, I am not claiming that it is premature to answer such questions because we lack a sufficient grasp of neutral networks. I am asserting that LLMs don't understand, because the question of whether they do is like asking whether A-flat is yellow.

by lo_zamoyski

5/20/2025 at 11:00:46 PM

Regardless of the mechanism, the foundational 'conceit' of LLMs is that by dumping enough syntax (and only syntax) into a sufficiently complex system, the semantics can be induced to emerge.

Quite a stretch, in my opinion (cf. Plato's Cave).

by somewhereoutth

5/21/2025 at 6:11:01 AM

> Regardless of the mechanism, the foundational 'conceit' of LLMs is that by dumping enough syntax (and only syntax) into a sufficiently complex system, the semantics can be induced to emerge.

Syntax has dual aspect. It is both content and behavior (code and execution, or data and rules, form and dynamics). This means syntax as behavior can process syntax as data. And this is exactly how neural net training works. Syntax as execution (the model weights and algorithm) processes syntax as data (activations and gradients). In the forward pass the model processes data, producing outputs. In the backward pass it is the weights of the model that become the data to be processed.

When such a self-generative syntactic system is in contact with an environment, in our case the training set, it can encode semantics. Inside the model data is relationally encoded in the latent space. Any new input stands in relation to all past inputs. So data creates its own semantic space with no direct access to the thing in itself. The meaning of a data point is how it stands in relation to all other data points.

Another important aspect is that this process is recursive. A recursive process can't be fully understood from outside. Godel, Turing, Chaitin prove that recursion produces blindspots, that you need to walk the recursive path to know it, you have to be it to know it. Training and inferencing models is such a process

The water carves its banks

The banks channel the water

Which is the true river?

Here, banks = model weights and water = language

by visarga

5/21/2025 at 2:57:33 AM

Anyone who has widely read topics across philosophy, science (physics, biology), economics, politics (policy, power), from practitioners, from original takes, news, etc. ... has managed to understand a tremendous number of relationships due to just words and their syntax.

While many of these relationships are related to things we see and do in trivial ways, the vast majority go far beyond anything that can be seen or felt.

What does economics look like? I don't know, but I know as I puzzle out optimums, or expected outcomes, or whatever, I am moving forms around in my head that I am aware of, can recognize and produce, but couldn't describe with any connection to my senses.

The same when seeking a proof for a conjecture in an idiosyncratic algebra.

Am I really dealing in semantics? Or have I just learned the graph-like latent representation for (statistical or reliable) invariant relationships in a bunch of syntax?

Is there a difference?

Don't we just learn the syntax of the visual world? Learning abstractions such as density, attachment, purpose, dimensions, sizes, that are not what we actually see, which is lots of dot magnitudes of three kinds. And even those abstractions benefit greatly from the words other people use describing those concepts. Because you really don't "see" them.

I would guess that someone who was born without vision, touch, smell or taste, would still develop what we would consider a semantic understanding of the world, just by hearing. Including a non-trivial more-than-syntactic understanding of vision, touch, smell and taste.

Despite making up their own internal "qualia" for them.

Our senses are just neuron firings. The rest is hierarchies of compression and prediction based on their "syntax".

by Nevermark

5/21/2025 at 7:30:46 AM

>Am I really dealing in semantics? Or have I just learned the graph-like latent representation for (statistical or reliable) invariant relationships in a bunch of syntax?

This and the rest of the comment are philosophical skepticism, and Kant blew this apart back when Hume's "bundle of experience" model of human subjects was considered an open problem in epistemology.

by viccis

5/21/2025 at 9:46:04 AM

Can you get into more detail and share some links? Inquiring minds want to know

by EGreg

5/22/2025 at 8:50:33 PM

This gives a good survey: https://plato.stanford.edu/entries/hume/#CopyPrin

>All our simple ideas in their first appearance are deriv’d from simple impressions, which are correspondent to them, and which they exactly represent.

>...he is so confident the correspondence holds that he challenges anyone who doubts it to produce an example of a simple impression without a corresponding simple idea, or a simple idea without a corresponding simple impression...

In other words, Hume thought that your ideas about things are a result, and only a result, of your impression of the thing. Knowledge must be, then, a posteriori. Indeed he reduces our "selves" into "bundles", which is to say nothing more than an accumulation of the various impressions we've received while living.

The problem with this is that it raises the question: How do we come up with novel thoughts that are not just reproducing things we have observed? (This is called synthetic a priori knowledge)

You can see at this point that this question is very similar to the one posed to AI right now. If it's nothing more than a bundle of information related to impressions it has received (by way of either text or image corpora), then can it really ever create anything novel that doesn't draw directly from an impression?

Kant delivered a decisive response to this with his Critique of Pure Reason and his Prolegomena to Any Future Metaphysics. He focused initially on Hume's biggest skepticism (about causality). Hume claimed that when we expect an effect from a cause, it's not because we're truly understanding how an effect can proceed from a cause, but rather because we've just observed it often enough that we expect it out of habit. Kant addresses this and expands it to any synthetic a priori statements.

He does so by dispensing with the idea that we can truly know everything there is to know about concepts. We simply schematize an understanding of the objects after observing them over time and use our sense of reason to generalize that collection of impressions into objects in our mind. From there, we can apply categorical reasoning that can be applied without the need for empirical evidence, and then produce synthetic a priori statements, to include expecting a specific effect from a specific cause. This is opposed to Hume, who said:

>It is far better, Hume concludes, to rely on “the ordinary wisdom of nature”, which ensures that we form beliefs “by some instinct or mechanical tendency”, rather than trusting it to “the fallacious deductions of our reason”

Hume's position was somewhat of a dead end, and Kant rescued philosophy (particularly epistemology and metaphysics) from it in many peoples' estimation.

The big difference between us and LLMs is that (right now), LLMs don't have a thinking component that transcends their empirical data modeling. It is conceivable that someone might produce an "AI" system that uses the LLM as a sensory apparatus and combines it with some kind of pure logic reasoning system (ironically, the kind of thing that old school AI focused on) to give it that kind of reasoning power. Because without something applying reasoning, all we have is some statistical patterns that we hope can give the best answer, but which can't guarantee anything.

by viccis

5/21/2025 at 5:43:35 AM

> Anyone who has widely read topics across philosophy, science (physics, biology), economics, politics (policy, power), from practitioners, from original takes, news, etc. ... has managed to understand a tremendous number of relationships due to just words and their syntax.

You're making a slightly different point from the person you're answering. You're talking about the combination of words (with intelligible content, presumably) and the syntax that enables us to build larger ideas from them. The person you're answering is saying that LLM work on the principle that it's possible for intelligence to emerge (in appearance if not in fact) just by digesting a syntax and reproducing it. I agree with the person you're answering. Please excuse the length of the below, as this is something I've been thinking about a lot lately, so I'm going to do a short brain dump to get it off my chest:

The Chinese Room thought experiment --treated by the Stanford Encyclopedia of Philosophy as possibly the single most discussed and debated thought experiment of the latter half of the 20th century -- argued precisely that no understanding can emerge from syntax, and thus by extension that 'strong AI', that really, actually understands (whatever we mean by that) is impossible. So plenty of people have been debating this.

I'm not a specialist in continental philosophy or social thought, but, similarly, it's my understanding that structuralism argued essentially the one can (or must) make sense of language and culture precisely by mapping their syntax. There aren't structulists anymore, though. Their project failed, because their methods don't work.

And, again, I'm no specialist, so take this with a grain of salt, but poststructuralism was, I think, built partly on the recognition that such syntax is artificial and artifice. The content, the meaning, lives somewhere else.

The 'postmodernism' that supplanted it, in turn, tells us that the structuralists were basically Platonists or Manicheans -- treating ideas as having some ideal (in a philosophical sense) form separate from their rough, ugly, dirty, chaotic embodiments in the real world. Postmodernism, broadly speaking, says that that's nonsense (quite literally) because context is king (and it very much is).

So as far as I'm aware, plenty of well informed people whose very job is to understand these issues still debate whether syntax per se confers any understanding whatsoever, and the course philosophy followed in the 20th century seems to militate, strongly, against it.

by globnomulous

5/21/2025 at 8:12:50 AM

I am using syntax in a general form to mean patterns.

We are talking about LLMs and the debate seems to be around whether learning about non-verbal concepts through verbal patterns (i.e. syntax that includes all the rules of word use, including constraints reflecting relations between words meaning, but not communication any of that meaning in more direct ways) constitutes semantic understanding or not.

In the end, all the meaning we have is constructed from the patterns our senses relay to us. We construct meaning from those patterns.

I.e. LLMs may or may not “understand” as well or deeply as we do. But what they are doing is in the same direction.

by Nevermark

5/21/2025 at 3:25:12 PM

> In the end, all the meaning we have is constructed from the patterns our senses relay to us. We construct meaning from those patterns.

Appears quite bold. What sense-relays inform us about infinity or other mathematical concepts that don't exist physically? Is math-sense its own sense that pulls from something extra-physical?

Doesn't this also go against Chomsky's work, the poverty of stimulus. That it's the recursive nature of language that provides so much linguistic meaning and ability, not sense data, which would be insufficient?

by meroes

5/24/2025 at 2:01:07 AM

What sense-relays inform us about infinity…

A waterfall that never seems to run dry. The ocean. Outer space. Time.

I think infinitude, i.e. a property of something which never ends, is a simplifying abstraction for the many things we come across in reality for which we can’t know where it ends or, in a specific context, we don’t care.

by kevinventullo

5/22/2025 at 5:41:25 PM

> Appears quite bold. What sense-relays inform us about infinity or other mathematical concepts that don't exist physically?

A great point. A fantastic question.

My guess is:

1. We learn useful patterns that are not explicitly in our environment, but are good simpler approximations to work with.

Some of these patterns only mean something in a given context, or are statistical happenstance.

But some of them are actual or approximate abstractions, potentially applicable to many other things.

2. Then we reason about these patterns.

Sometimes we create new patterns that reveal deeper insights about our environment.

Sometimes we create nonsense, which is either obviously nonsense, fools those who don't reason carefully (i.e. bullshit). And some nonsense is so psychologically attractive that it helps some of us pose and believe we are special and connected to higher planes.

And sometimes we create patterns that offer deeper insights into patterns themselves. I.e. abstractions, like counting numbers, arithmetic, logic, and infinity.

It is worth considering, that the progression of abstractions, from unary counting, to more scalable number notations, zero as a number, negative numbers, etc. took a couple hundred thousands years to get going. But once we got going, every small new abstraction helped progress compound faster and faster.

At the level of abstract thinking, I view humans as intelligent as a species, not as individuals. Even the greatest minds, a very small proportion of us, had to stand on innumerable inherited abstractions to make significant progress.

Now many people contribute to new abstractions, but we have inherited powerful abstractions about abstractions to work with.

Erase all that accumulated knowledge for a new generation of humans, and very few would make much or any accumulated progress in explicit abstractions for a very long time.

by Nevermark

5/21/2025 at 11:28:08 AM

Curious what you make of symbolic mathematics, then - in particular, systems like Mathematica which can produce true and novel mathematical facts by pure syntactic manipulation.

The truth is, syntax and semantics are strongly intertwined and not cleanly separable. A "proof" is merely a syntactically valid string in some formal system.

by dTal

5/20/2025 at 8:14:37 PM

Absolutely, the first task should be to understand how and why black boxes with emergent properties actually work, in order to further knowledge - but importantly, in order to improve them and build on the acquired knowledge to surpass them. That implies, curbing «parrot[ing]» and inadequate «understand[ing]».

I.e. those higher concepts are kept in mind as a goal. It is healthy: it keeps the aim alive.

by mdp2021

5/21/2025 at 5:58:28 AM

My favorite argument against SP is zero shot translation. The model learns Japanese-English and Swahili-English and then can translate Japanese-Swahili directly. That shows something more than simple pattern matching happens inside.

Besides all arguments based on model capabilities, there is also an argument from usage - LLMs are more like pianos than parrots. People are playing the LLM on the keyboard, making them 'sing'. Pianos don't make music, but musicians with pianos do. Bender and Gebru talk about LLMs as if they work alone, with no human direction. Pianos are also dumb on their own.

by visarga

5/21/2025 at 8:35:56 AM

The translation happens because of token embeddings. We spent a lot of time developing rich embeddings that capture contextual semantics. Once you learn those, translation is “simply” embedding in one language, and disembedding in another.

This does not show complex thinking behavior, although there are probably better examples. Translation just isn’t really one of them.

by Hendrikto

5/22/2025 at 12:08:15 AM

Furthermore: Learning additional languages fine tunes the embedding.

by spartanatreyu

5/21/2025 at 9:23:32 AM

This is also the problem I have with John Searle’s Chinese room

by EGreg

5/21/2025 at 7:38:29 AM

> The model learns Japanese-English and Swahili-English and then can translate Japanese-Swahili directly. That shows something more than simple pattern matching happens inside.

The "water story" is a pivotal moment in Helen Keller's life, marking the start of her communication journey. It was during this time that she learned the word "water" by having her hand placed under a running pump while her teacher, Anne Sullivan, finger-spelled the word "w-a-t-e-r" into her other hand. This experience helped Keller realize that words had meaning and could represent objects and concepts.

As the above human experience shows, aligning tokens from different modalities is the first step in doing anything useful.

by nthingtohide

5/20/2025 at 7:54:29 PM

1000%. It's really hard to express this to non-engineers who never wasted years of their life trying to work with n-grams and NLTK (even topic models) to make sense of textual data... Projects I dreamed of circa 2012 are now completely trivial. If you do have that comparison ready-at-hand, the problem of understanding what this mind-blowing leap means, to which end I find writing like the OP helpful, is so fascinating and something completely different than complaining that it's a "black box."

I've expressed this on here before, but it feels like the everyday reception of LLMs has been so damaged by the general public having just gotten a basic grasp on the existence of machine learning.

by agentcoops

5/20/2025 at 5:42:57 PM

Thanks for the follow up. I've been following your circuits thread for several years now. I find the linear representation hypothesis very compelling, and I have a draft of a review for Toy Models of Superposition sitting in my notes. Circuits I find less compelling, since the analysis there feels very tied to the transformer architecture in specific, but what do I know.

Re linear representation hypothesis, surely it depends on the architecture? GANs, VAEs, CLIP, etc. seem to explicitly model manifolds. And even simple models will, due to optimization pressure, collapse similar-enough features into the same linear direction. I suppose it's hard to reconcile the manifold hypothesis with the empirical evidence that simple models will place similar-ish features in orthogonal directions, but surely that has more to do with the loss that is being optimized? In Toy Models of Superposition, you're using a MSE which effectively makes the model learn an autoencoder regression / compression task. Makes sense then that the interference patterns between co-occurring features would matter. But in a different setting, say a contrastive loss objective, I suspect you wouldn't see that same interference minimization behavior.

by theahura

5/20/2025 at 6:05:50 PM

> Circuits I find less compelling, since the analysis there feels very tied to the transformer architecture in specific, but what do I know.

I don't think circuits is specific to transformers? Our work in the Transformer Circuits thread often is, but the original circuits work was done on convolutional vision models (https://distill.pub/2020/circuits/ )

> Re linear representation hypothesis, surely it depends on the architecture? GANs, VAEs, CLIP, etc. seem to explicitly model manifolds

(1) There are actually quite a few examples of seemingly linear representations in GANs, VAEs, etc (see discussion in Toy Models for examples).

(2) Linear representations aren't necessarily in tension with the manifold hypothesis.

(3) GANs/VAEs/etc modeling things as a latent gaussian space is actually way more natural if you allow superposition (which requires linear representations) since central limit theorem allows superposition to produce Gaussian-like distributions.

by colah3

5/20/2025 at 6:24:52 PM

> the original circuits work was done on convolutional vision models

O neat, I haven't read that far back. Will add it to the reading list.

To flesh this out a bit, part of why I find circuits less compelling is because it seems intuitive to me that neural networks more or less smoothly blend 'process' and 'state'. As an intuition pump, a vector x matrix matmul in an MLP can be viewed as changing the basis of an input vector (ie the weights act as a process) or as a way to select specific pieces of information from a set of embedding rows (ie the weights act as state).

There are architectures that try to separate these out with varying degrees of success -- LSTMs and ResNets seem to have a more clear throughline of 'state' with various 'operations' that are applied to that state in sequence. But that seems really architecture-dependent.

I will openly admit though that I am very willing to be convinced by the circuits paradigm. I have a background in molecular bio and there's something very 'protein pathways' about it.

> Linear representations aren't necessarily in tension with the manifold hypothesis.

True! I suppose I was thinking about a 'strong' form of linear representations, which is something like: features are represented by linear combinations of neurons that display the same repulsion-geometries as observed in Toy Models, but that's not what you're saying / that's me jumping a step too far.

> GANs/VAEs/etc modeling things as a latent gaussian space is actually way more natural if you allow superposition

Superposition is one of those things that has always been so intuitive to me that I can't imagine it not being a part of neural network learning.

But I want to make sure I'm getting my terminology right -- why does superposition necessarily require the linear representation hypothesis? Or, to be more specific, does [individual neurons being used in combination with other neurons to represent more features than neurons] necessarily require [features are linear compositions of neurons]?

by theahura

5/20/2025 at 7:00:47 PM

> True! I suppose I was thinking about a 'strong' form of linear representations, which is something like: features are represented by linear combinations of neurons that display the same repulsion-geometries as observed in Toy Models, but that's not what you're saying / that's me jumping a step too far.

Note this happens in "uniform superposition". In reality, we're almost certainly in very non-uniform superposition.

One key term to look for is "feature manifolds" or "multi-diemsnional features". Some discussion here: https://transformer-circuits.pub/2024/july-update/index.html...

(Note that the term "strong linear representation" is becoming a term of art in the literature referring to the idea that all features are linear, rather than just most or some.)

> I want to make sure I'm getting my terminology right -- why does superposition necessarily require the linear representation hypothesis? Or, to be more specific, does [individual neurons being used in combination with other neurons to represent more features than neurons] necessarily require [features are linear compositions of neurons]?

When you say "individual neurons being used in combination with other neurons to represent more features than neurons", that's a way one might _informally_ talk about superposition, but doesn't quite capture the technical nuance. So it's hard to know the full scope of what you intend. All kinds of crazy things are possible if you allow non-linear features, and it's not necessarily clear what a feature would mean.

Superposition, in the narrow technical sense of exploiting compressed sensing / high-dimensional spaces, requires linear representations and sparsity.

by colah3

5/20/2025 at 8:16:10 PM

> One key term to look for is "feature manifolds" or "multi-diemsnional features"

I should probably read the updates more. Not enough time in the day. But yea the way you're describing feature manifolds and multidimensional features, especially the importance of linearity-in-properties and not necessarily linearity-in-dimensions, makes a lot of sense and is basically how I default think about these things.

> but doesn't quite capture the technical nuance. So it's hard to know the full scope of what you intend.

Fair, I'm only passingly familiar with compressed sensing so I'm not sure I could offer a more technical definition without, like, a much longer conversation! But it's good to know in the future that in a technical sense linear representations and superposition are dependent.

> all features are linear, rather than just most or some

Potentially a tangent, but compared to what? I suppose the natural answer is "non linear features" but has there been anything to suggest that neural networks represent concepts in this way? I'd be rather surprised if they did within a single layer. (Across layers, sure, but that actually starts to pull me more towards circuits)

by theahura

5/21/2025 at 2:36:27 PM

I was going to comment the same about the Superposition hypothesis [0], when the OP comment (edit: Update: The OP commenter is (as pointed by other HN comments, the cofounder of Anthropic) behind the Superposition research) mentioned about "I've had a lot more success with: * The linear representation hypothesis - The idea that "concepts" (features) correspond to directions in neural networks", as this concept-per-NN-feature idea seems too "basic" to explain some of the learning which NNs can do on datasets. On one of our custom trained neural network models (not LLM, but audio-based and currently proprietary) we noticed the same of the ML model being able to "overfit" on a large amount of data despite not many few parameters relative to the size of the dataset (and that too with dropout in early layers).

[0] https://www.anthropic.com/research/superposition-memorizatio...

by rajnathani

5/20/2025 at 11:29:04 PM

This has mirrored my experience attempting to "apply" topology in real world circumstances, off and on since I first studied topology in 2011.

I even hesitate now at the common refrain "real world data approximates a smooth, low dimensional manifold." I want to spend some time really investigating to what extent this claim actually holds for real world data, and to what extent it is distorted by the dimensionality reduction method we apply to natural data sets in order to promote efficiency. But alas, who has the time?

by j2kun

5/20/2025 at 5:35:39 PM

I think it's interesting that in physics, different global symmetries (topological manifolds) can satisfy the same metric structure (local geometry). For example, the same metric tensor solution to Einstein's field equation can exist on topologically distinct manifolds. Conversely, looking at solutions to the Ising Model, we can say that the same lattice topology can have many different solutions, and when the system is near a critical point, the lattice topology doesn't even matter.

It's only an analogy, but it does suggest at least that the interesting details of the dynamics aren't embedded in the topology of the system. It's more complicated than that.

by riemannzeta

5/20/2025 at 6:06:54 PM

If you like symmetry, you might enjoy how symmetry falls out of circuit analysis of conv nets here:

https://distill.pub/2020/circuits/equivariance/

by colah3

5/20/2025 at 10:47:40 PM

Thanks for this additional link, which really underscores for me at least how you're right about patterns in circuits being a better abstraction layer for capturing interesting patterns than topological manifolds.

I wasn't familiar with the term "equivariance" but I "woke up" to this sort of approach to understanding deep neural networks when I read this paper, which shows how restricted boltzman machines have an exact mapping to the renormalization group approach used to study phase transitions in condensed matter and high energy physics:

https://arxiv.org/abs/1410.3831

At high enough energy, everything is symmetric. As energy begins to drain from the system, eventually every symmetry is broken. All fine structure emerges from the breaking of some symmetries.

I'd love to get more in the weeds on this work. I'm in my own local equilibrium of sorts doing much more mundane stuff.

by riemannzeta

5/20/2025 at 8:38:15 PM

That earlier post had a few small HN discussions (for those interested):

Neural Networks, Manifolds, and Topology (2014) - https://news.ycombinator.com/item?id=19132702 - Feb 2019 (25 comments)

Neural Networks, Manifolds, and Topology (2014) - https://news.ycombinator.com/item?id=9814114 - July 2015 (7 comments)

Neural Networks, Manifolds, and Topology - https://news.ycombinator.com/item?id=7557964 - April 2014 (29 comments)

by dang

5/20/2025 at 7:03:01 PM

Loved these posts and they inspired a lot of my research and directions during my PhDs.

For anyone interested in these may I also suggest learning about normalizing flows? (They are the broader class to flow matching) They are learnable networks that learn coordinate changes. So the connection to geometry/topology is much more obvious. Of course the down side of flows is you're stuck with a constant dimension (well... sorta) but I still think they can help you understand a lot more of what's going on because you are working in a more interpretable environment

by godelski

5/20/2025 at 5:05:58 PM

hey chris, I found your posts quite inspiring back then, with very poetic ideas. cool to see you follow up here!

by winwang

5/21/2025 at 4:35:24 PM

Consider looking into fields related to machine learning to see how topology is used there. The main problem is that some of the cool math did not survive the transition to CS, e.g. the math for control theory is not quite present in RL.

In terms of topology, control theory has some very cool topological interpretations, e.g. toruses appear quite a bit in control theory.

by adamnemecek

5/20/2025 at 5:52:49 PM

My guess is that the linear representation hypothesis is only approximately right in the sense that my expectation is that it is more like a Lie Group. Locally flat, but the concept breaks at some point. Note that I am a mathematician who knows very little about machine learning apart from taking a few classes at uni

by iNic

5/20/2025 at 11:39:40 PM

The linear representation hypothesis is rather quite intreguing, I am curious what was the intuition behind it.

by 3abiton

5/20/2025 at 11:47:41 PM

See https://transformer-circuits.pub/2022/toy_model/index.html#m...

If you're new to this, I'd mostly just look at all the empirical examples.

The slightly harder thing is to consider the fact that neural networks are made of linear functions with non-linearities between them, and to try to think about when linear directions will be computationally natural as a result.

by colah3

5/20/2025 at 2:06:53 PM

If it was topology we wouldn't bother to warp the manifold so we can do similarity search. No, it's geometry, with a metric. Just as in real life, we want to be able to compare things.

Topological transformation of the manifold happens during training too. That makes me wonder: how does the topology evolve during training? I imagine it violently changing at first before stabilizing, followed by geometric refinement. Here are some relevant papers:

* Topology and geometry of data manifold in deep learning (https://arxiv.org/abs/2204.08624)

* Topology of Deep Neural Networks (https://jmlr.org/papers/v21/20-345.html)

* Persistent Topological Features in Large Language Models (https://arxiv.org/abs/2410.11042)

* Deep learning as Ricci flow (https://www.nature.com/articles/s41598-024-74045-9)

by esafak

5/20/2025 at 4:03:13 PM

> Topological transformation of the manifold happens during training too. That makes me wonder: how does the topology evolve during training?

If you've ever played with GANs or VAEs, you can actually answer this question! And the answer is more or less 'yes'. You can look at GANs at various checkpoints during training and see how different points in the high dimensional space move around (using tools like UMAP / TSNE).

> I imagine it violently changing at first before stabilizing, followed by geometric refinement

Also correct, though the violent changing at the beginning is also influenced the learning rate and the choice of optimizer.

by theahura

5/20/2025 at 4:06:53 PM

And crucially, the initialization algorithm.

by esafak

5/20/2025 at 2:35:43 PM

Agree, if anything it's Applied Linear Algebra...but that sounds less exotic.

by profchemai

5/20/2025 at 3:17:13 PM

Well, we know it is non-linear. More like differential equations.

by lostmsu

5/20/2025 at 2:34:40 PM

I really liked this article, though I don't know why the author is calling the idea of finding a separating surface between two classes of points "topology." For instance, they write

"If you are trying to learn a translation task — say, English to Spanish, or Images to Text — your model will learn a topology where bread is close to pan, or where that picture of a cat is close to the word cat."

This is everything that topology is not about: a notion of points being "close" or "far." If we have some topological space in which two points are "close," we can stretch the space so as to get the same topological space, but with the two points now "far". That's the whole point of the joke that the coffee cup and the donut are the same thing.

Instead, the entire thing seems to be a real-world application of something like algebraic geometry. We want to look for something like an algebraic variety the points are near. It's all about geometry and all about metrics between points. That's what it seems like to me, anyway.

by ComplexSystems

5/20/2025 at 3:06:23 PM

> This is everything that topology is not about

100 percent true.

I can only hope that in an article that is about two things, i) topology and ii) deep learning, the evident confusions are contained within one of them -- topology, only.

by srean

5/20/2025 at 4:07:25 PM

fair, I was using 'topology' more colloquially in that sentence. Should have said 'surface'.

by theahura

5/20/2025 at 5:28:01 PM

Ah! That clears it up.

You then mean Deep Learning has a lot in common with differential geometry and manifolds in general. That I will definitely agree with. DG and manifolds have far richer and informative structure than topology.

by srean

5/20/2025 at 11:43:48 PM

If I had to give a loose definition of topology, I would say that it is actually about studying spaces which have some notion of what is close and far, even if no metric exists. The core idea of neighborhoods in point set topology captures the idea of points being nearby another point, and allows defining things like continuity and sequence convergence which require a notion of closeness. From Wikipedia [0] for example

The terms 'nearby', 'arbitrarily small', and 'far apart' can all be made precise by using the concept of open sets. If we change the definition of 'open set', we change what continuous functions, compact sets, and connected sets are. Each choice of definition for 'open set' is called a topology. A set with a topology is called a topological space.

Metric spaces are an important class of topological spaces where a real, non-negative distance, also called a metric, can be defined on pairs of points in the set. Having a metric simplifies many proofs, and many of the most common topological spaces are metric spaces.

That's not to say that topology is necessarily the best lens for understanding neural networks, and the article's author has shown up in the comments to state he's moved on in his thinking. I'm just trying to clear up a misconception.

[0] https://en.wikipedia.org/wiki/General_topology

by steppi

5/20/2025 at 2:55:06 PM

The title, as it stands, is trite and wrong. More about that a little later. The article on the other hand is a pleasant read.

Topology is whatever little structure that remains in geometry after you throwaway distances, angles, orientations and all sorts of non tearing stretchings. It's that bare minimum that still remains valid after such violent deformations.

While notion of topology is definitely useful in machine learning, -- scale, distance, angles etc., all usually provide lots of essential information about the data.

If you want to distinguish between a tabby cat and a tiger it would be an act of stupidity to ignore scale.

Topology is useful especially when you cannot trust lengths, distances angles and arbitrary deformations. That happens, but to claim deep learning is applied topology is absurd, almost stupid.

by srean

5/20/2025 at 4:05:33 PM

> Topology is useful especially when you cannot trust lengths, distances angles and arbitrary deformations

But...you can't. The input data lives on a manifold that you cannot 'trust'. It doesn't mean anything apriori that an image of a coca-cola can and an image of a stopsign live close to each other in pixel space. The neural network applies all of those violent transformations you are talking about

by theahura

5/20/2025 at 4:18:41 PM

> But...you can't.

Only in a desperate sales pitch or a desparate research grants. There are of course some situations were certain measurements are untrustworthy, but to claim that is the common case is very snake oily.

When certain measurements become untrustworthy, that it does so only because of some unknown smooth transformation, is not very frequent (this is what purely topological methods will deal with). Random noise will also do that for you.

Not disputing the fact that sometimes metrics cannot be trusted entirely, but to go to a topological approach seems extreme. One should use as much of the relevant non-topological information as possible.

As the hackneyed example goes a topological methods would not be able to distinguish between a cup and a donut. For that you would need to trust non-topological features such as distances and angles. Deep learning methods can indeed differentiate between cop-nip and coffee mugs.

BTW I am completely on-board with the idea that data often looks as if it has been sampled from an unknown, potentially smooth, possibly non-Euclidean manifold and then corrupted by noise. In such cases recovering that manifold from noisy data is a very worthy cause.

In fact that is what most of your blogpost is about. But that's differential geometry and manifolds, they have structure far richer than a topology. For example they may have tangent planes, a Reimann metric or a symplectic form etc. A topological method would throw all of that away and focus on topology.

by srean

5/20/2025 at 4:35:37 PM

I don't think that was their point, I think their point was that neural networks 'create' their optimization space by using lengths, distances, and angles. You can't reframe it from a topological standpoint, otherwise optimization spaces of some similar neural networks on similar problems would topologically comparable, which is not true.

by kentuckyrobby

5/20/2025 at 5:45:10 PM

Well, sorta. There is some evidence to suggest that neural networks learn 'universal' features (cf Anthropic's circuits thread). But I'll openly admit to being out of my depth here, and maybe I just don't understand OPs point

by theahura

5/20/2025 at 4:54:19 PM

once you get into the nitty gritty, a lot of things that wouldn't matter if it were pure topology, do, like number of layers all the way to quantization/fp resolution

by throwawaymaths

5/20/2025 at 5:33:21 PM

The word "topology" has a legitimate dictionary definition, that has none of the requirements that you're asserting. I think what you're missing is that it has two definitions.

by quantadev

5/20/2025 at 5:40:12 PM

In blog posts about specialised and technical topics it is expected that in-domain technical keywords that have long established definitions and meanings be used in the same technical sense. Otherwise it can become quite confusing. Gravity means gravity when we are talking Newtonian mechanics. Similarly, in math and ML 'topology' has a specific meaning.

by srean

5/20/2025 at 6:18:21 PM

The word "topology" is quite commonly used in all kinds of books, papers, and technical materials any time they're discussing geometric characteristics of surfaces. The term is probably used 1000000 times more commonly in this more generic way than it's ever used in the strict pedantic way you're asserting that it must.

by quantadev

5/20/2025 at 6:57:17 PM

Surfaces certainly have a topology (potentially more than one), surfaces are examples of one kind of a topological space, in fact the next interesting one after a curve. So I will not be surprised at all with co-occurrences of 'surface' and 'topology'. But surfaces and topologies mean different things.

Dogs have fur. Dogs are an example of a furry animal. But dogs and furs are not the same thing although they may appear in the same text often.

Topology is a traditional as well as an active branch of applied and pure mathematics, well, Physics too.

It has tons of text books printed on it, has several active journals and conferences dealing with it. https://www.amazon.com/s?k=Topology&sprefix=topology+%2Caps%...

Surprise, surprise ...not ...has an extensive Wikipedia page.

https://en.m.wikipedia.org/wiki/Topology

Math magazines for high schoolers have articles on it. Colleges offer multiple courses on it. Some of those courses would be mandatory for a degree in even undergrad mathematics.

If one wants to do graduate studies then one can do a Masters or a PhD in Topology, well in one of it's many branches.

It's also not a new kid on the block. It goes back to ... analysis situs ... further back to Leibniz, although it began to crystalize formally after Poincare.

If someone wants to use the phrase 'differential calculus' to mean something else in their love letters and sweet nothings, that's absolutely fine :) but in Maths (and Machine Learning, well, with quality of peer reviewing this might soon be iffy) it has a well established and unambiguous meaning.

Note because of its shared beginning at the feet of Leibniz, comparing it with calculus is not an unfair comparison.

by srean

5/20/2025 at 7:41:21 PM

The most common uses of "topology", whenever used to convey a geometry-related idea, is in the more general sense meaning "surfaces". Only one out of a million times is anyone ever referring to the specific mathematical field of the same name to which you refer.

by quantadev

5/20/2025 at 7:53:14 PM

LOL the very first line of your own link

(that now you seem to have deleted after my comment https://en.m.wikipedia.org/wiki/Topology_(disambiguation) )

"Topology is a branch of mathematics concerned with geometric properties preserved under continuous deformation (stretching without tearing or gluing)"

That is indeed the established meaning of topology, more so in mathematics and the blog post was on applied mathematics. That it may mean something else in other contexts is irrelevant.

I rest my case.

> The most common uses of "topology", whenever used to convey a geometry-related idea, is in the more general sense meaning "surfaces"

Erm, citation please.

I included a search on Amazon on topology https://www.amazon.com/s?k=Topology&sprefix=topology+%2Caps%... (without even adding the keyword maths. None of the results seem to be about surfaces. Shouldn't there have been a few ? Wouldn't Amazon search results reflect the general sense meaning ?).

If it were true, wouldn't the Wikipedia pages have talked about that general sense meaning first ?

Alternatively, I would say, take a breath. Is this hill really the one worth dying on ? There are better ones. Have a good day and if work permits, get yourself a juicy topology book, it can be interesting, if presented well.

by srean

5/20/2025 at 8:19:58 PM

The question was never "Is topology a field of mathematics". The question was, is that term most often used to refer to surfaces in general, and the answer to that is still 'yes'.

by quantadev

5/20/2025 at 8:45:00 PM

https://www.amazon.com/s?k=Topology&sprefix=topology+%2Caps%...

Could you drop Amazon a message. They should really change the search results if that's what topology means in general. I am sure they would be delighted to receive the bug report.

For laughs, I asked ChatGPT to use 'topology' in place of 'surface'. Here's what it wrote:

    The topology of the lake was so calm it reflected the mountains perfectly.

    She wiped the kitchen topology clean after cooking dinner.

    After years of silence, the truth began to topology.

    The spacecraft landed safely on the topology of Mars.

    A thin layer of dust had settled on the topology of the old table.

    He barely scratched the topology of the topic in his presentation.

    As the submarine ascended, it broke through the topology of the ocean.

    The topology of the road was slick with ice.

    Despite her calm topology, she was extremely nervous inside.

    The paint bubbled and peeled off the topology due to the heat.

Our disagreement aside, I think these are hilarious. We should agree on that.

The best was

    The surface of a doughnut resembles that of a coffee mug due to their similar structure.

by srean

5/20/2025 at 9:41:18 PM

> Alternatively, I would say, take a breath. Is this hill really the one worth dying on ?

This applies both ways. Maybe you could relax with the facetiousness? It doesn’t help your argument and makes you look like an ass.

by TimorousBestie

5/20/2025 at 6:14:28 PM

The phrase "applied X" invokes the technical, scientific, or academic meaning of X. So for example, "applied chemistry" does not refer to one's experience on a dating app.

by cvoss

5/20/2025 at 6:22:41 PM

The word "topology" is _much_ more commonly used as a general synonym for "surfaces" than in any other way.

by quantadev

5/20/2025 at 2:49:42 PM

Thanks for sharing. I also tend to view learning in terms of manifolds. It's a powerful representation.

> I'm personally pretty convinced that, in a high enough dimensional space, this is indistinguishable from reasoning

I actually have journaled extensively about this and even written some on Hacker News about it with respect to what I've been calling probabilistic reasoning manifolds:

> This manifold is constructed via learning a decontextualized pattern space on a given set of inputs. Given the inherent probabilistic nature of sampling, true reasoning is expressed in terms of probabilities, not axioms. It may be possible to discover axioms by locating fixed points or attractors on the manifold, but ultimately you're looking at a probabilistic manifold constructed from your input set.

> But I don't think you can untie this "reasoning" from your input data. It's possible you will find "meta-reasoning", or similar structures found in any sufficiently advanced reasoning manifold, but these highly decontextualized structures might be entirely useless without proper recontextualization, necessitating that a reasoning manifold is trained on input whose patterns follow learnable underlying rules, if the manifold is to be useful for processing input of that kind.

> Decontextualization is learning, decomposing aspects of an input into context-agnostic relationships. But recontextualization is the other half of that, knowing how to take highly abstract, sometimes inexpressible, context-agnostic relationships and transform them into useful analysis in novel domains

Full comment: https://news.ycombinator.com/item?id=42871894

by soulofmischief

5/20/2025 at 3:05:04 PM

Are you talking about reasoning in general, reasoning qua that mental process which operates on (representations of) propositions?

In which case, I cannot understand " true reasoning is expressed in terms of probabilities, not axioms "

One of the features of reasoning is that it does not operate in this way. It's highly implausible animals would have been endowed with no ability to operate non-probabilistically on propositions represented by them, since this is essential for correct reasoning -- and a relatively trivial capability to provide.

Eg., "if the spider is in boxA, then it is not everywhere else" and so on

by mjburgess

5/20/2025 at 3:32:20 PM

Propositions are just predictions, they all come with some level of uncertainty even if we ignore that uncertainty for practical purposes.

Any validation of a theory is inherently statistical, as you must sample your environment with some level of precision across spacetime, and that level of precision correlates to the known accuracy of hypotheses. In other words, we can create axiomatic systems of logic, but ultimately any attempt to compare them to reality involves empirical sampling.

Unlike classical physics, our current understanding of quantum physics essentially allows for anything to be "possible" at large enough spacetime scales, even if it is never actually "probable". For example, quantum tunneling, where a quantum system might suddenly overcome an energy barrier despite lacking the required energy.

Every day when I walk outside my door and step onto the ground, I am operating on a belief that gravity will work the same way every time, that I won't suddenly pass through the Earth's crust or float into the sky. We often take such things for granted, as axiomatic, but ultimately all of our reasoning is based on statistical correlations. There is the ever-minute possibility that gravity suddenly stops working as expected.

> if the spider is in boxA, then it is not everywhere else

We can't even physically prove that. There's always some level of uncertainty which introduces probability into your reasoning. It's just convenient for us to say, "it's exceedingly unlikely in the entire age of the universe that a macroscopic spider will tunnel from Box A to Box B", and apply non-probabilistic heuristics.

It doesn't remove the probability, we just don't bother to consider it when making decisions because the energy required for accounting for such improbabilities outweighs the energy saved by not accounting for them.

As mentioned in my comment, there's also the possibility that universal axioms may be recoverable as fixed points in a reasoning manifold, or in some other transformation. If you view these probabilities as attractors on some surface, fixed points may represent "axioms" that are true or false under any contextual transformation.

by soulofmischief

5/20/2025 at 6:22:41 PM

This response doesn't fill me with confidence. You aren't really engaging with any of the actual issues your position entails.

A proposition is not a prediction. A prediction is either an estimate of the value of some quantity ("the dumb ML meaning of prediction") or a proposition which describes a future scenario. We can trivially enumerate propositions that do not describe future scenarios, eg., 2 + 2 = 4.

Uncertainty is a property of belief attitudes towards propositions, it isn't a feature of their semantic content. A person doesnt mean anything different by "2 + 2 = 4" if they are 80 or 90% sure of it.

> We can't even physically prove that.

Irrelevant. Our minds are not constrained by physical possibility, necessarily so, as we know very little about what is physically possible. I can imagine abitary number of cases, arising out of logical manipulation of propositons, that are not physically possible. (Eg., "Superman can lift any building. The empire state building is so-and-so a kind of building. Imagine(Superman lifting the empire state building)").

The infinite variety of our imagination is a trivial consequence of non-probabilistic operations on propositions, it's incomprehensibly implausible as a consequence of merely probabilistic ones.

That nature seems to have endowed minds with discrete operations, that these are empirical in operation across very wide classes of reasoning, including imagination, that these seem trivial for nature to provide (etc.) render the notion that they don't exist highly highly implausible.

There is nothing lacking explanation here. The relevant mental processes we have to hand are fairly obvious and fairly easy to explain.

Its an obvious act of credulity to try and find some way to make the latest trinkets of the recent rich some sort of miracle. All of these projects of "incredible abstraction" follow around these hype cycles, turning lead into gold: if x "is really" y, and y "is really" z, and ..., then x is amazin! This piles towers of every more general hollowed-out words on top of each other until the most trivial thing sounds like a wonder.

by mjburgess

5/20/2025 at 5:51:50 PM

> It's highly implausible animals would have been endowed with no ability to operate non-probabilistically on propositions represented by them, since this is essential for correct reasoning

Why would animals need to evolve 100% correct reasoning if probabilistically correct reasoning suffices? If probabilistic reasoning is cheaper in terms of energy then correct reasoning is a disadvantage.

by naasking

5/20/2025 at 6:08:48 PM

It doesnt suffice. It's also vastly energetically cheaper just to have (algorithmic) negation. Compressing (A, not A) into a probability function is extremely incomprehensibly expensive.

by mjburgess

5/20/2025 at 10:03:26 PM

> It's also vastly energetically cheaper just to have (algorithmic) negation.

Even if true, that's an argument that it's cheaper to have something, not that it's cheaper to develop it through natural selection. Training time and energy for LLMs shows how energy intensive training to get to the point of grokking/circuit generalization.

by naasking

5/20/2025 at 10:38:23 PM

It is a matter of empirical fact that we can reason with logical relationships. Thus taking an LLM and it's training as a model of conginition is empriically false.

It should be obviously doubly so, since as a model -- as you point out -- it makes trivial aspects of our cognition impossibly expensive to acqurie.

by mjburgess

5/21/2025 at 2:30:31 PM

> It is a matter of empirical fact that we can reason with logical relationships

It is a matter of empirical fact that our ability to correctly reason with logical relationships only has high statistical likelihood, not certainty. This looks less like actual logic and more like a probabilistic model of logic.

by naasking

5/20/2025 at 3:14:37 PM

I suspect, as a layperson who watches people make decisions all the time, that somewhere in our mind is a "certainty checker".

We don't do logic itself, we just create logic from certainty as part of verbal reasoning. It's our messy internal inference of likelihoods that causes us to pause and think, or dash forward with confidence, and convincing others is the only place we need things like "theorems".

This is the only way I can square things like intuition, writing to formalize thoughts, verbal argument, etc, with the fact that people are just so mushy all the time.

by jvanderbot

5/20/2025 at 6:26:35 PM

People are only mushy in their verbalised reasoning, because its the nature of such reasoning to handle hard cases. Animal cognition, at its basic levels, is incredibly refined and makes necessary use of logic, flawlessly, frequently.

This naive cynicism about our mental capacities is a product of this credulity about statistical AI. If one beings with an earnest study of animal intelligence, in order to describe it, it disappears. It's exactly and only a project of the child playing with his lego, certain that great engineering projects have little use for any more than stacking bricks.

by mjburgess

5/20/2025 at 6:59:53 PM

Well, we disagree fundamentally. And, I applaud the heavy handed use of condescension.

Logical propositions ("2+2=4 regardless of my certainty about it") seem a long way from necessary or sufficient to survival for animals. A fuzzy heatmap of "where is prey going" or "How many prey over there" is much closer to necessary and sufficient. The fact that measurements or senses can update those estimates is a long way from a logical deduction.

Something more like probability factor graph can do it, without the pernicious use of "concepts" or too much need for implication, which is sticky and overly rigorous.

That's all I have to say, and I doubt we'll find middle ground.

by jvanderbot

5/20/2025 at 10:36:55 PM

You can enumerate, all you wish, all the fuzzy judgements we need to make. This confirms a capacity for uncertain reasoning. It says nothing about the trivial and innumerable ways concepts compose both in content (imagine that A and-also B) and in logical relation (eg., imagine that not A and B).

The point of my "condescension" was to point out that people of your position are arguing from ignorance, with confirmation bias -- ie., having no study of animal intelligence, and only ever repeating what they know about their own study of irrelevant systems.

Your reply evidences this exactly. Zero engagement with any facts on the ground about actual animal intelligence. Are you really actually trying to account for animal intelligence, or as I have claimed twice now, are you only really wishing to maintain your ignorance of it, dismiss any analysis of it, and instead "confirm" that whatever you are aware of "must, presumably, apply".

Imagine being faced with such bad faith over and over and over again. It is like arguing with people who insist the world is flat, and when challenged, point to euclidean geometry and the flatness of the pavement under their feet. If I begin by anticipating such behaviour, you can see why.

by mjburgess

5/21/2025 at 2:05:22 PM

Hey it's possible you're absolutely right, but our arguments have about equal value (none) because they are both presented as personal opinions with zero external support. I just had the decency to admit that up front.

by jvanderbot

5/20/2025 at 2:12:35 PM

Data doesn't actually live on a manifold. It's an approximation used for thinking about data. Near total majority, if not 100%, of the useful things done in deep learning have come from not thinking about topology in any way. Deep learning is not applied anything, it's an empirical field advanced mostly by trial and error and, sure, a few intuitions coming from theory (that was not topology).

by umutisik

5/20/2025 at 2:23:45 PM

I disagree with this wholeheartedly. Sure, there is lots of trial and error, but it’s more an amalgamation of theory from many areas of mathematics including but not limited to: topology, geometry, game theory, calculus, and statistics. The very foundations (i.e. back-propagation) is just the chain rule applied to the weights. The difference is that deep learning has become such an accessible (sic profitable) field that many practitioners have the luxury of learning the subject without having to learn the origins of the formalisms. Ultimately allowing them to utilize or “reinvent” theories and techniques often without knowing they have been around in other fields for much longer.

by sota_pop

5/20/2025 at 3:20:35 PM

None of the major aspects of deep learning came from manifolds though.

It is primarily linear algebra, calculus, probability theory and statistics, secondarily you could add something like information theory for ideas like entropy, loss functions etc.

But really, if "manifolds" had never been invented/conceptualized, we would still have deep learning now, it really made zero impact on the actual practical technology we are all using every day now.

by saberience

5/20/2025 at 3:58:37 PM

Loss landscapes can be viewed as manifolds. Adagrad/ADAM adjust SGD to better fit the local geometry and are widely used in practice.

by qbit42

5/20/2025 at 2:53:05 PM

Can you give an example where theories and techniques from other fields are reinvented? I would be genuinely interested for concrete examples. Such "reinventions" happen quite often in science, so to some degree this would be expected.

by kwertzzz

5/20/2025 at 3:02:48 PM

Bethe ansatz is one. It took a toure de force by Yedidia to recognize that loopy belief propagation is computing the stationary point of Bethe's approximation to Free Energy.

Many statistical thermodynamics ideas were reinvented in ML.

Same is true for mirror descent. It was independently discovered by Warmuth and his students as Bregman divergence proximal minimization, or as a special case would have it, exponential gradient algorithms.

One can keep going.

by srean

5/20/2025 at 3:11:41 PM

The connections of deep learning to stat-mech and thermodynamics are really cool.

It's led me to wonder about the origin of the probability distributions in stat-mech. Physical randomness is mostly a fiction (outside maybe quantum mechanics) so probability theory must be a convenient fiction. But objectively speaking, where then do the probabilities in stat-mech come from? So far, I've noticed that the (generalised) Boltzmann distribution serves as the bridge between probability theory and thermodynamics: It lets us take non-probabilistic physics and invent probabilities in a useful way.

by ogogmad

5/20/2025 at 3:21:09 PM

In Boltzmann's formulation of stat-mech it comes from the assumption that when a system is in "equilibrium", then all the micro-states that are consistent with the macro-state are equally occupied. That's the basis of the theory. A prime mover is thermal agitation.

It can be circular if one defines equilibrium to be that situation when all the micro-states are equally occupied. One way out is to define equilibrium in temporal terms - when the macro-states are not changing with time.

by srean

5/20/2025 at 4:23:54 PM

The Bayesian reframing of that would be that when all you have measured is the macrostate, and you have no further information by which to assign a higher probability to any compatible microstate than any other, you follow the principle of indifference and assign a uniform distribution.

by mitthrowaway2

5/20/2025 at 5:03:37 PM

Yes indeed, thanks for pointing this out. There are strong relationships between max-ent and Bayesian formulations.

For example one can use a non-uniform prior over the micro-states. If that prior happens to be in the Darmois-Koopman family that implicitly means that there are some non explicitly stated constraints that bind the micro-state statistics.

by srean

5/20/2025 at 3:08:47 PM

One might add 8-16-bit training and quantization. Also, computing semi-unreliable values with error correction. Such tricks have been used in embedded, software development on MCU's for some time.

by nickpsecurity

5/20/2025 at 3:55:36 PM

I mean the entire domain of systems control is being reinvented by deep RL. System identification, stability, robustness etc

by whatever1

5/20/2025 at 5:31:59 PM

Good one. Slightly different focus but they really are the same topic. Historically, Control Theory has focused on stability and smooth dynamics while RL has traditionally focused on convergence of learning algorithms in discrete spaces.

by srean

5/20/2025 at 2:22:01 PM

> a few intuitions coming from theory (that was not topology).

I think these 'intuitions' are an after-the-fact thing, meaning AFTER deep learning comes up with a method, researchers in other fields of science notice the similarities between the deep learning approach and their (possibly decades old) methods. Here's an example where the author discovers that GPT is really the same computational problems he has solved in physics before:

https://ondrejcertik.com/blog/2023/03/fastgpt-faster-than-py...

by behnamoh

5/20/2025 at 3:00:22 PM

I beg to differ. It's complete hyperbole to suggest that the article said "it's the same problem as something in physics", given this statement:

     It seems that the bottleneck algorithm in GPT-2 inference is matrix-matrix multiplication. For physicists like us, matrix-matrix multiplication is very familiar, *unlike other aspects of AI and ML* [emphasis mine]. Finding this familiar ground inspired us to approach GPT-2 like any other numerical computing problem.

Note: Matrix-matrix multiplication is basic mathematics, and not remotely interesting as physics. It's not physically interesting.

by ogogmad

5/20/2025 at 3:26:50 PM

Agreed.

Although, to try to see it from the author’s perspective, it is pulling tools out of the same (extremely well developed and studied in it’s own right) toolbox as computational physics does. It is a little funny although not too surprising that a computational physics guy would look at some linear algebra code and immediately see the similarity.

Edit: actually, thinking a little more, it is basically absurd to believe that somebody has had a career in computational physics without knowing they are relying heavily on the HPC/scientific computing/numerical linear algebra toolbox. So, I think they are just using that to help with the narrative for the blog post.

by bee_rider

5/20/2025 at 4:25:05 PM

You are exactly right, after deep learning researchers had invented Adam for SGD, numerical analysts finally discovered Gradient descent. And after the first neural net was discovered, finally the matrix was invented in the novel field of linear algebra.

by constantcrying

5/20/2025 at 3:07:36 PM

I say this as someone who has been in deep learning for over a decade now: this is pretty wrong, both on the merits (data obviously lives on a manifold) and on its applications to deep learning (cf chris olah's blog as an example from 2014, which is linked in my post -- https://colah.github.io/posts/2014-03-NN-Manifolds-Topology/). Embedding spaces are called 'spaces' for a reason. GANs, VAEs, contrastive losses -- all of these are about constructing vector manifolds that you can 'walk' to produce different kinds of data.

by theahura

5/20/2025 at 6:55:44 PM

If data did live on a manifold contained, e.g. images in R^{n^2}, then it wouldn't have thickness or branching, which it does. It's an imperfect approximation to help think about it. The use of mathematical language is not the same as an application of mathematics (and the use of the word 'space' there is not about topology).

by umutisik

5/20/2025 at 3:40:34 PM

You're citing a guy that never went to college (has no math or physics degree), has never published a paper, etc. I guess that actually tracks pretty well with how strong the whole "it's deep theory" claim is.

by almostgotcaught

5/20/2025 at 3:50:43 PM

Chris Olah has never published a paper? ... https://scholar.google.com/citations?user=6dskOSUAAAAJ&hl=en...

by smsx

5/20/2025 at 4:22:31 PM

[dead]

by wetpaws

5/20/2025 at 3:58:27 PM

Chris Olah? One of the founders of Anthropic and the head of their interpretability team?

by theahura

5/20/2025 at 2:29:42 PM

It’s alchemy.

Deep learning in its current form relates to a hypothetical underlying theory as alchemy does to chemistry.

In a few hundred years the Inuktitut speaking high schoolers of the civilisation that comes after us will learn that this strange word “deep learning” is a left over from the lingua franca of yore.

by niemandhier

5/20/2025 at 2:58:05 PM

Not really, most of the current approaches are some approximations of the partition function.

by adamnemecek

5/20/2025 at 4:17:05 PM

The reason deep learning is alchemy is that none of these deep theories have predictive ability.

Essentially all practical models are discovered by trial and error and then "explained" after the fact. In many papers you read a few paragraphs of derivation followed by a simpler formulation that "works better in practice". E.g., diffusion models: here's how to invert the forward diffusion process, but actually we don't use this, because gradient descent on the inverse log likelihood works better. For bonus points the paper might come up with an impressive name for the simple thing.

In most other fields you would not get away with this. Your reviewers would point this out and you'd have to reformulate the paper as an experience report, perhaps with a section about "preliminary progress towards theoretical understanding". If your theory doesn't match what you do in practice - and indeed many random approaches will kind of work (!) - then it's not a good theory.

by fmap

5/20/2025 at 4:23:20 PM

It's true that there is no directly predictive model of deep learning, and it's also true that there is some trial and error, but it is wrong to say that therefore there is no operating theory at all. I recommend reading Ilyas 30 papers (here's my review of that set: https://open.substack.com/pub/theahura/p/ilyas-30-papers-to-...) to see how shared intuitions and common threads are clearly developed over the last decade+

by theahura

5/21/2025 at 8:33:02 AM

That is a great list, do you know of something similar that is more recent?

by niemandhier

5/20/2025 at 3:11:34 PM

It does if you relax your definition to accommodate approximation error, cf. e.g., Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning (https://aclanthology.org/2021.acl-long.568.pdf)

by esafak

5/20/2025 at 2:32:10 PM

> Data doesn't actually live on a manifold.

Often, they do (and then they are called "sheaves").

by Koshkin

5/20/2025 at 3:29:29 PM

Many types of data don’t. Disconnected spaces like integer spaces don’t sit on a manifold (they are lattices). Spiky noisy fragmented data don’t sit on a (smooth) manifold.

In fact not all ML models treat data as manifolds. Nearest neighbors, decision trees don’t require the manifold assumption and actually work better without it.

by wenc

5/20/2025 at 4:04:12 PM

Any reasonable statistical explanation of deep learning requires there to be some sort of low dimensional latent structure in the data. Otherwise, we would not have enough training data to learn good models, given how high the ambient dimensions are for most problems.

by qbit42

5/20/2025 at 4:46:12 PM

Deep learning specifically yes. It needs a manifold assumption. But not data in general which was what I was responding to.

by wenc

5/20/2025 at 4:25:50 PM

It turns out a lot of disconnected spaces can be approximated by smooth ones that have really sharp boundaries, which more or less seems to be how neural networks will approximate something like discrete tokens

by theahura

5/20/2025 at 4:51:22 PM

Can be approximated yes. Approximated well? No, but you can get away with it sometimes with saturation functions like softmax. But badly. It’s like trying to solve an integer program as a linear program. You end up with a relaxation that is not integral and not the real answer.

An integer lattice can only be a manifold in a trivial sense (dimension 0). But not for any positive dimensions.

by wenc

5/20/2025 at 2:21:41 PM

Your comment sits in the nice gradient between not seeing at all the obvious relationships between deep learning and topology and thinking that deep learning is applied topology.

See? Everything lives in the manifold.

Now for a great visualization about the Manifold Hypothesis I cannot recommend more this video: https://www.youtube.com/watch?v=pdNYw6qwuNc

That helps to visualize how the activation functions, bias and weights (linear transformations) serve to stretch the high dimensional space so that data go into extremes and become easy to put in a high dimension, low dimensional object (the manifold) where is trivial to classify or separate.

Gaining an intuition about this process will make some deep learning practices so much easy to understand.

by motoboi

5/20/2025 at 2:21:03 PM

I cannot understand this prideful resentment of theory common among self-described practitioners.

Even if existing theory is inadequate, would an operating theory not be beneficial?

Or is the mystique combined with guess&check drudgery job security?

by thuuuomas

5/20/2025 at 3:08:47 PM

If there were theory that led to directly useful results (like, telling you the right hyperparameters to use for your data in a simple way, or giving you a new kind of regularization that you can drop in to dramatically improve learning) then deep learning practitioners would love it. As it currently stands, such theories don't really exist.

by canjobear

5/20/2025 at 4:00:55 PM

This is way too rigorous. You can absolutely have theories that lead to useful results even if they aren't as predictive as you describe. Theory of evolution for an obvious counterpoint.

by theahura

5/20/2025 at 3:54:48 PM

Useful theories only come to exist because someone started by saying they must exist and then spent years or lifetimes discovering them.

by fiddlerwoaroof

5/20/2025 at 2:24:43 PM

There are strong incentives to leave theory as technical debt and keep charging forward. I don't think it's resentment of theory, everyone would love a theory if one were available but very few are willing to forgoe the near term rewards to pursue theory. Also it's really hard.

by jebarker

5/20/2025 at 3:18:31 PM

There are many reasons to believe a theory may not be forthcoming, or that if it is available may not be useful.

For instance, we do not have consensus on what a theory should accomplish - should it provide convergence bounds/capability bounds? Should it predict optimal parameter counts/shapes? Should it allow more efficient calculation of optimal weights? Does it need to do these tasks in linear time?

Even materials science in metals is still cycling through theoretical models after thousands of years of making steel and other alloys.

by lumost

5/20/2025 at 2:23:54 PM

Maybe a little less with the ad hominems? The OP is providing an accurate description of an extremely immature field.

by hiddencost

5/20/2025 at 2:47:50 PM

Many mathematicians are (rightly, IMO) allergic to assertions that certain branches are not useful (explicit in OP) and especially so if they are dismissive of attempts to understand complicated real world phenomema (implicit in OP, if you ask me).

by cnity

5/20/2025 at 3:49:56 PM

Who is proud? What you are seeing in some cases is eye rolling. And it's fair eye rolling.

There is an enormous amount of theory used in the various parts of building models, there just isn't an overarching theory at the very most convenient level of abstraction.

It almost has to be this way. If there was some neat theory, people would use it and build even more complex things on top of it in an experimental way and then so on.

by danielmarkbruce

5/20/2025 at 3:52:22 PM

Just a side comment to your observation: the principle is called reductionism and has been tried on many fields.

Physics is just applied mathematics

Chemistry is just applied physics

Biology is just applied chemistry

It doesn’t work very well.

by baxtr

5/20/2025 at 4:22:00 PM

>it's an empirical field advanced mostly by trial and error and, sure, a few intuitions coming from theory (that was not topology).

Neural Networks consist almost exclusively of two parts, numerical linear algebra and numerical optimization.

Even if you reject the abstract topological description. Numerical linear algebra and optimization couldn't be any more directly applicable.

by constantcrying

5/20/2025 at 3:57:51 PM

> Near total majority, if not 100%, of the useful things done in deep learning have come from not thinking about topology in any way.

Of course. Now, to actually deeply understand what is happening with these constructs, we will use topology. Topoligical insights will without doubt then inform the next generations of this technology.

by yubblegum

5/20/2025 at 4:01:35 PM

May I ask you to give examples of insights from topology which improved existing models, or at least improved our understanding of them? arxiv papers are preferred.

by solomatov

5/20/2025 at 3:49:23 PM

I feel like the fact that ML has no good explanation why it works this well gives a lot of people room to invent their head-canon, usually from their field of expertise. I've seen this from exceptionally intelligent individuals too. If you only have a hammer...

by Regic

5/20/2025 at 4:58:24 PM

I think it would be more unusual, and concerning, if an intelligent individual didn't attempt to apply their expertise for a head-canon of something unknown.

Coming up with an idea for how something works, by applying your expertise, is the fundamental foundation of intelligence, learning, and was behind every single advancement of human understanding.

People thinking is always a good thing. Thinking about the unknown is better. Thinking with others is best, and sharing those thoughts isn't somehow bad, even if they're not complete.

by nomel

5/20/2025 at 5:17:03 PM

When you say ML, I assume you really mean LLMs?

Even with LLMs, there's no real mystery about why they work so well - they produce human-like input continuations (aka "answers") because they are trained to predict continuations of human-generated training data. Maybe we should be a bit surprised that the continuation signal is there in the first place, but given that it evidentially is, it's no mystery that LLMs are able to use it - just testimony to the power of the Transformer as a predictive architecture, and of course to gradient descent as a cold unthinking way of finding an error minimum.

Perhaps you meant how LLMs work, rather than why they work, but I'm not sure there's any real mystery there either - the transformer itself is all about key-based attention, and we now know that training a transformer seems to consistently cause it to leverage attention to learn "induction heads" (using pairs of adjacent attention heads) that are the main data finding/copying primitive they use to operate.

Of course knowing how an LLM works in broad strokes isn't the same as knowing specifically how it is working in any given case, how is it transforming a specific input layer by layer to create the given output, but that seems a bit like saying that because I can't describe - precisely - why you had pancakes for breakfast, that we don't know how the brains works.

by HarHarVeryFunny

5/20/2025 at 3:13:44 PM

"All models are wrong, but some are useful" -George Box

by csimon80

5/20/2025 at 3:37:10 PM

I don't agree with your first sentence, but I agree with the rest of this post.

by woopwoop

5/20/2025 at 2:21:34 PM

Once I read "This has been enough to get us to AGI.", credibility took a nose dive.

In general it's a nice idea, but the blogpost is very fluffy, especially once it connects it to reasoning, there is serious technical work in this area (i.g. https://arxiv.org/abs/1402.1869) that has expanded this idea and made it more concrete.

by profchemai

5/20/2025 at 3:39:39 PM

Another type of topology you’ll encounter in deep neural networks (DNNs) is network topology. This refers to the structure of the network — how the nodes are connected and how data flows between them. We already have several well-known examples, such as auto-encoders, convolutional neural networks (CNNs), and generative adversarial networks (GANs), all of which are bio-inspired.

However, we still have much to learn about the topology of the brain and its functional connectivity. In the coming years, we are likely to discover new architectures — both internal within individual layers/nodes and in the ways specialized networks connect and interact with each other.

Additionally, the brain doesn’t rely on a single network, but rather on several ones — often referred to as the "Big 7" — that operate in parallel and are deeply interconnected. Some of these include the Default Mode Network (DMN), the Central Executive Network (CEN) or the Limbic Network, among others. In fact, a single neuron can be part of multiple networks, each serving different functions.

We have not yet been able to fully replicate this complexity in artificial systems, and there is still much to be learned and inspired by from this "network topologies".

So, "Topology is all you need" :-)

by vayllon

5/25/2025 at 9:07:35 AM

>as long as we can separate good from bad we can train a neural network to sort out the topology for us.

10-ish years ago, I saw a project training networks to guess biological sex from face photos. They carefully removed makeup, moustache, hair, etc, so the model would be unbiased, yet they only reached 70 to 80% correct guesses. Yet it seemed like a great result, and they were trying to reach 99%.

First thing I did after reading their paper, was seek a paper where people would try and guess the biological sex from similar photos. And people weren't that much better at it. The difference between people and machine guessing was 1 or 2 percent.

I asked the guys that run the project, how they proved that such a division, based only on a photo, was even possible. They didn't understand the question, they just assumed that you can do it.

They couldn't improve their results in the end. Maybe they sucked at teaching neural networks, or maybe a lot of faces just are unisex if you remove gender markers.

I bring this anecdote because this guys, in my eyes, made a reasonable assumption. An assumption that since they in most situations can guess what's in someone pants by seeing someones face, the face has this information.

The assumption that we could somehow separate good from bad, when we rewrite school books every year, when we try to calculate "half-life of knowledge", when philosophy as a science isn't over, and every day there are political and ideological debates about what's best, is a very-very unreasonable assumption.

by lesostep

5/25/2025 at 9:09:39 AM

I forgot to conclude:

In the end, it's not even reasonable to assume that such a divide between "good" and "bad" exists at all.

by lesostep

5/20/2025 at 3:13:59 PM

I'm confused by the author's diagram claiming that AGI/ASI are points on the same manifold as next token prediction, chat models, and CoT models. While the latter three are provably part of the same manifold, what justifies placing AGI/ASI there too?

What if the models capable of CoT aren't and will never be, regardless of topological manipulation, capable of processes that could be considered AGI? For example, human intelligence (the closest thing we know to AGI) requires extremely complex sensory and internal feedback loops and continuous processing unlike autoregressive models' discrete processing.

As a layman, this matches my intuition that LLMs are not at all in the same family of systems as the ones capable of generating intelligence or consciousness.

by terabytest

5/20/2025 at 4:12:21 PM

Possible. AGI/ASI are poorly defined. I tend to think we're already at AGI, obviously many disagree.

> For example, human intelligence (the closest thing we know to AGI) requires extremely complex sensory and internal feedback loops and continuous processing unlike autoregressive models' discrete processing.

I've done a fair bit of connectomics research and I think that this framing elides the ways in which neural networks and biological networks are actually quite similar. For example, in mice olfactory systems there is something akin to a 'feature vector' that appears based on which neurons light up. Specific sets of neurons lighting up means 'chocolate' or 'lemon' or whatever. More generally, it seems like neuronal representations are somewhat similar to embedding representations, and you could imagine constructing an embedding space based on what neurons light up where. Everything on top of the embeddings is 'just' processing.

by theahura

5/20/2025 at 5:00:46 PM

I believe we already have the technology required for AGI. It perhaps is analogous to a lunar manned station or a 2 mile tall skyscrapper. We have the technology required to build it, but we don't for various reasons.

by fusionadvocate

5/20/2025 at 5:20:40 PM

For the last few years, I've been "seeing maps" whenever I think about LLMs. It's always felt like the most natural way to understand what is going on.

It's also for this reason that I think new knowledge is discoverable from with in LLMs.

I imagine having a topographic map of some island that has only been explored partially by humans. But if I know the surrounding topography, I can make pretty accurate guesses about the areas I haven't been. And I think the same thing can be applied to certain areas of human knowledge, especially when represented as text or symbolically.

by ada1981

5/20/2025 at 3:22:52 PM

The question is not so much whether this is true—we can certainly represent any data as points on a manifold. Rather, it’s the extent to which this point of view is useful. In my experience, it’s not the most powerful perspective.

In short, direct manifold learning is not really tractable as an algorithmic approach. The most powerful set of tools and theoretical basis for AI has sprung from statistical optimization theory (SGD, information-theoretical loss minimization, etc.). The fact that data is on a manifold is a tautological footnote to this approach.

by _alternator_

5/20/2025 at 7:43:33 PM

I want to share one more related observation: by definition, topology math refers to geometrical objects and transformations. But there exists another, more computer-esque definition of topology that defines relations between abstract objects.

For example, let's take a look at graph data structure. A graph has a set of stored objects (vertices) and a set of stored relations between the vertices (edges). In this way, graph defines a topology in discrete form.

Let's take a look at network data structure which is closely related to the graph. It is very much the same idea, but it additionally has a value stored in every edge. A network has a set of objects (vertices) and a set of relations between the objects (edges), while edges also hold edge values. So it is also a form of topology because the network defines the relations between the abstract objects.

In this light, you can view a graph as a neural network with {0, 1} weights. The graph edge is either present or absent, hence {0, 1} values only. The network structure, however, can hold any assigned value in every edge, so every connection between objects (neurons) can be characterized not only by its presence, but also by edge-assigned values (weights). Now we get the full model of a neural network. And yes, it is built upon topology in its discrete form.

by garganzol

5/20/2025 at 2:16:05 PM

I’ve always enjoyed this framing of the subject, the idea of mapping anything as hyperplanes existing in a solution space is one of the ideas that really blew my hair back during my academic studies. I would nitpick at your “dots in a circle example - with the stoner reference joke” I could be mistaken, but common practice isn’t to “move to a higher dimension”, but use a kernel (i.e. parameterize the points into the polar |r,theta> basis). All things considered, nice article.

by sota_pop

5/20/2025 at 4:15:25 PM

I'm pulling directly from Chris Olah's blog post with that example. But I will say that in practice, its always surprising how increasing the dimensionality of a neural network magically solves all sorts of problems. You could use a kernel if you don't have more computation available, but given more computation adding a dimension is strictly more flexible (and is capable of separating a much wider range of datasets)

by theahura

5/20/2025 at 5:33:42 PM

Your explanation of finding a surface to separate good reasoning traces from bad reasoning traces in a high dimensional space worked as a great framing of the problem. It seems though that the surface will be fractal - the distance between a good trace and a bad trace could be arbitrarily small. If so then the work required to find and compute better and better surfaces will grow arbitrarily large. I wonder if there is a rigorous way to determine if the surface is fractal or not.

by benl

5/20/2025 at 4:37:05 PM

> One way to think about neural networks, especially really large neural networks, is that they are topology generators. That is, they will take in a set of data and figure out a topology where the data has certain properties. Those properties are in turn defined by the loss function.

Latent spaces may or may not have useful topology, so this idea is inherently wrong, and builds the wrong type of intuition. Different neural nets will result in different feature space understanding of the same data, so I think it's incorrect to believe you're determining intrinsic geometric properties from a given neural net. I don't think people should throw around words carelessly because all that does is increase misunderstanding of concepts.

In general, manifolds can help discern useful characteristics about the feature space, and may have useful topological structures, but trying to impose an idea of "topology" on this is a stretch. Moreover, the kind of basics examples used in this blog post don't help prove the author's point. Maybe I am misunderstanding this author's description of what they mean, but this idea of manifold learning is nothing new.

by nis0s

5/20/2025 at 2:17:50 PM

This is also how I've often thought about deep learning -- focusing on the geometry of the data at each layer rather than the weights and biases is far more revealing.

I've always been hopeful that some algebraic topology master would dig into this question and it'd provide some better design principles for neural nets. which activation functions? how much to fan in/out? how many layers?

by parpfish

5/20/2025 at 4:01:12 PM

I was one of the people that was super excited after reading the Chris Olah blogpost from 2014, and over the past decade I've seen the insight go exactly nowhere. It's neat but it hasn't driven any interesting results, though Ayasdi did some interesting stuff with TDA and Gunnar Carlson has been playing around with neural nets recently.

by Graviscalar

5/20/2025 at 4:30:41 PM

I think it's incorrect that the insight has gone nowhere. See, for eg, contrastive loss / clip, or vqgan image generation. Arguably also diffusion models.

More generally, in my experience as an AI researcher, understandings of the geometry of data leads directly to changes in model architecture. Though people disparage that as "trial and error" it is far more directed than people on the outside give credit for.

by theahura

5/20/2025 at 4:39:28 PM

The geometric intuition is solid, but actually applying topology has been less fruitful in spite of a lot of people trying their best, as Chris Olah himself has said elsewhere in this thread.

by Graviscalar

5/20/2025 at 9:26:14 PM

Ayasdi immediately came to mind too seeing this post. I haven't thought of them in a long time, looks like they got bought out in 2019, prepandemic too which was probably best since mid pandemic had a lot of poor valuations

https://www.symphonyai.com/news/financial-services/ayasdi-jo...

by jxramos

5/20/2025 at 4:07:00 PM

What would you have expected to happen?

Advances and insights sometimes lie dormant for decades or more before someone else picks them up and does something new.

by rubitxxx8

5/20/2025 at 4:21:40 PM

I would expect model/algorithm improvements from using topological concepts to analyze the manifolds in question or concrete results in model interpretability. Gunnar has studied some toy examples, but they were barely a step up from the ones Olah constructed for the sake of explanation and they haven't borne any further fruit.

You can say any advance or insight is just lying dormant, it doesn't mean anything unless you can specifically articulate why it still has potential. I haven't made any claims on the future of the intersection of deep learning and topology, I was pointing out that it's been anything but dormant given the interest in it but it hasn't lead anywhere.

by Graviscalar

5/22/2025 at 1:24:50 AM

Huge fan of the idea in fiction of Applied Topology is magic https://en.wikipedia.org/wiki/The_Laundry_Files

> Howard is recruited to work for the Q-Division of SOE, otherwise known as "the Laundry", the British government agency which deals with occult threats. "Magic" is described as being a branch of applied computation (mathematics), therefore computers and equations are just as useful, and perhaps more potent, than classic spellbooks, pentagrams, and sigils for the purpose of influencing ancient powers and opening gates to other dimensions.

by iFire

5/20/2025 at 4:22:30 PM

Isn't Deep Learning more like Graph Theory? I shared yesterday that Google published a paper called CRISP (https://arxiv.org/pdf/2505.11471) that carefully avoids any reference to the word "Graph".

So then the question becomes what's the difference between Graph Theory and Applied Topology? Graphs operate on discrete structures and topology is about a continuous space. Otherwise they're very closely related.

But the higher order bit is that AI/ML and Deep Learning in particular could do a better job of learning from and acknowledging prior art from related fields. Reusing older terminology instead of inventing new.

by adsharma

5/20/2025 at 4:40:18 PM

This whole article is just a nothingburger. Saying something is applied topology is only one step more advanced than saying something is maths - duh. These mathematical abstractions are incredibly general and and you can pretty much draw up anything in terms of anything, the challenging part is being able to turn around and use the model/abstraction to say things about the thing you're abstracting. I don't think scholars have been very successful in that regard, less so this article.

Yeah deep learning is applied topology, it's also applied geometry, and probably applied algebra and I wouldn't be surprised if it was also applied number theory.

by deepburner

5/21/2025 at 4:35:09 PM

This is an insightful read but at a place where a prompt churns out a whole model - that's yet far far away and extremely compute intensive.

Philosophical point being - if human brain is also all planes all the way down with trillion dimensional planes separating concepts.

I don't think so that is the case but yes - we might mimic something like that and such a thing will always malfunction more than their human counterparts in a given subject area and would always hit a wall where a human won't.

by wg0

5/20/2025 at 2:16:20 PM

Ok, how do transformers fit into this understanding of deep learning?

by polotics

5/20/2025 at 3:19:07 PM

Transformers learn embedding representations of tokens, which are easily mapped into a space. Similar tokens are mapped to similar places on the space. The fully connected layer at the end of each transformer block defines a transformation of a set of points in a space to another point in that space, not unlike the example of adding colors together to get a new color

by theahura

5/20/2025 at 3:09:19 PM

Transformers don't feel differentiable (because of the attention mechanism), but they actually are (as being back-propagation based forces it to be).

The attention mechanism is not a stretching of the manifold, but is trained to be able to measure distances in the manifold surface, which is stretched and deformed (or transformed?) in the feed-forward layers.

by motoboi

5/20/2025 at 4:19:02 PM

Transformers (with self-attention being the key operation) are kernel smoothers which fits easily into this view of the world. See here: http://bactra.org/notebooks/nn-attention-and-transformers.ht...

by jebarker

5/20/2025 at 2:34:49 PM

Speaking of: https://www2.math.upenn.edu/~ghrist/notes.html

by Koshkin

5/20/2025 at 6:31:38 PM

Applied topology. Might a Klein bottle actually be a useful?

by CliffStoll

5/20/2025 at 9:40:32 PM

It's a bit different than what's discussed here, but color-contrast detectors in neural networks can be thought of as forming a Klein bottle: https://distill.pub/2020/circuits/equivariance/#hue-rotation...

(This is, in some sense, for similar reason to Gunnar Carlson et al finding a Klein bottle when looking at high-contrast image patches, except one level more abstract, since it's about features rather than data points.)

by colah3

5/21/2025 at 2:19:48 AM

Aren't manifolds generally task-dependent?

I've been debating whether the data lies on a manifold, or whether the data attributes that are task-relevant (and of our interest) lie on a manifold?

I suspect it is the latter, but I've seen Platonic Representation Hypothesis that seems to hint it is the former.

by paraschopra

5/20/2025 at 2:50:58 PM

Isn't it more differential geometry?

by maxiepoo

5/20/2025 at 3:39:36 PM

Interesting read. Reminded me of the Trinity 3D manifold visualization tool which (among other things) let's you explore the hyperspace of neural networks: https://github.com/trinity-xai/Trinity

by crgi

5/21/2025 at 1:01:40 AM

One of the most successful quant traders in history, Jim Simons studied topology in the 60s. It’s rumored he used deep neural networks in his trading before they were cool. This post really brought the two together for me in a way I didn’t understand before.

by strimp099

5/20/2025 at 5:57:26 PM

On the TDA in general, I found this series of articles by Brandon Brown helpful:

https://github.com/outlace/outlace.github.io

by Koshkin

5/20/2025 at 4:32:10 PM

> This has been enough to get us to AGI.1

Hard disagree.

by kookamamie

5/20/2025 at 2:21:54 PM

Just because manifold looks a bit like burrito if you squint doesn't mean it is a burrito.

by mirekrusin

5/20/2025 at 2:42:52 PM

What if you don't have to squint very much?

by ComplexSystems

5/20/2025 at 7:34:31 PM

Same amount of squinting needed as for claim that deep learning is just a bunch of matrices. Or 0s and 1s. Cool. Now what?

by mirekrusin

5/21/2025 at 9:29:07 AM

The deep learning you are talking about is better understood as a regression.

by xchip

5/20/2025 at 2:28:51 PM

Interesting read. This seems hard to prove:

"Everything lives on a manifold"

by fedeb95

5/20/2025 at 3:50:12 PM

To a topologist, everything is topology.

by mbowcut2

5/20/2025 at 4:35:17 PM

To a man with only a hammer, everything looks like a nail.

by kookamamie

5/20/2025 at 5:31:45 PM

So ml is about drawing the line.

by revskill

5/20/2025 at 4:39:47 PM

Cool post! Thanks

by khoangothe

5/21/2025 at 3:54:37 PM

[dead]

by techlatest_net

5/20/2025 at 2:20:47 PM

[flagged]

by amelius

5/20/2025 at 3:00:57 PM

Life is just applied category theory

by adverbly

5/20/2025 at 2:28:34 PM

[flagged]

by Koshkin

5/20/2025 at 4:11:09 PM

[flagged]

by k__

5/20/2025 at 11:32:42 PM

[flagged]

by fasdfdsa