What if AI doesn't need more RAM but better math?

3/29/2026 at 1:54:39 PM

"The TurboQuant paper (ICLR 2026) contains serious issues in how it describes RaBitQ, including incorrect technical claims and misleading theory/experiment comparisons.

We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views.

We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (https://openreview.net/forum?id=tO3AS KZlok ).

We would greatly appreciate your attention and help in sharing it."

https://x.com/gaoj0017/status/2037532673812443214

by imjonse

3/29/2026 at 5:31:50 PM

I guess I'm trying to understand. I'm hearing this paper has been around for a year -- I would think that many companies would have already implemented and measured its performance in production by now... is that not the case?

by zug_zug

3/29/2026 at 8:23:33 PM

Okay, I spent about half an hour reading about this and asking gemini I guess my best understanding is this:

The main breakthrough [rotating by an orthogonal matrix to make important outliers averaged acrossed more dimensions] comes from RaBitQ. Sounds like the RaBitQ team was much more involved, and earlier, and the turbo quant paper very deliberately tries to avoid crediting and acknowledging RaBitQ.

My understanding is that the efficacy of these methods isn't in dispute, what turboquant did was adapt the method that was being used in vector databases and adapted it for transformers, and passed it of more as a new invention than an adaptation.

by zug_zug

3/29/2026 at 2:09:42 PM

Openreview link is not working, was split apparently.

https://openreview.net/forum?id=tO3ASKZlok

by _0ffh

3/29/2026 at 12:22:49 PM

> applying this compression algorithm at scale may significantly relax the memory bottleneck issue.

I don’t think they’re going to downsize though, I think the big players are just going to use the freed up memory for more workflows or larger models because the big players want to scale up. It’s a cat and mouse race for the best models.

by konaraddi

3/29/2026 at 1:51:42 PM

It will also help with local inference, making AI without big players possible.

by miohtama

3/29/2026 at 4:24:55 PM

It's already possible. Post-training is vastly more important than model size. (There's bigtime diminishing returns with increasing model size.)

by otabdeveloper4

3/29/2026 at 4:59:15 PM

Is there a size cutoff you would say where diminishing returns really kick in?

My experience doesn't disagree, at least. I've been using Qwen for coding locally a bit. It is much better than I thought it would be. But also still falls short in some obvious ways compared to the frontiers.

by plagiarist

3/30/2026 at 7:31:15 AM

> Is there a size cutoff you would say where diminishing returns really kick in?

No idea yet. But also it's obvious that making LLMs without MoE is stupid.

by otabdeveloper4

3/29/2026 at 12:46:31 PM

Known in the business as 'pulling a jevons'

by Verdex

3/29/2026 at 12:39:31 PM

The drop in memory stocks seems counterintuitive to me.

The demand for memory isn't going to go down, we'll just be able to do more with the same amount of memory.

by mustyoshi

3/29/2026 at 8:24:47 PM

Well, when a companies have 100billion dollar incentives to make discoveries like this, I don't know if we should assume this is the only optimization that will happen.

Given that increasing model size doesn't yield proportional increases in intelligence, there is a world where these datacenters don't have a positive ROI if we make these models even a fraction as effective as the human brain.

by zug_zug

3/29/2026 at 5:09:27 PM

It especially doesn't make sense considering that TurboQuant has been public on arXiv for almost a year: https://arxiv.org/abs/2504.19874 So it predates the late-2025 RAM price surge! https://pcpartpicker.com/trends/price/memory/

I think that either investors were extremely skittish that the stocks might crash and jumped at the first sign of trouble (creating a self-fulfilling prophecy) or they were trading on non-public information and analysts who don't have access to said information are reading too much into the temporal coincidence of the Google Research blog highlighting this paper.

by yorwba

3/30/2026 at 6:27:19 AM

Well considering basically the entire market was down these past few days, Google included, its unlikely attributable to this paper alone. Its most likely correlated with general war/trade route restrictions/potential recession fears, or at least, more correlated with those than it is with this paper.

This paper was released a year ago and was probably part of how google got to 1m context before other labs.

by boshalfoshal

3/29/2026 at 4:18:09 PM

The stock drop isn't about demand volume, it's about pricing power. HBM vendors have been charging huge premiums because AI buyers had no alternative to buying more memory. A 6x compression result means per-GB willingness to pay drops even if total shipments hold. Flat volume at lower margins is a worse business than growing volume at premium margins.

by clawfund

3/29/2026 at 4:56:08 PM

It could also reduce the total cost of AI to the point it becomes feasible for more tasks, increasing the demand, in case Jevon's kicks in.

by aljgz

3/29/2026 at 10:24:48 AM

Despite the shortage, RAM is still cheaper than mathematicians.

by fph

3/29/2026 at 12:51:40 PM

It's also less frustrating to organize world wide ram production and logistics than to deal with a single mathematician.

Constantly sitting around trying to solve problems that nobody has made headway on for hundreds of years. Or inventing theorems around 15th century mysticism that won't be applicable for hundreds of years.

Now if you'll excuse me I need to multiply some numbers by 3 and divide them by 2 ... I'm so close guys.

by Verdex

3/29/2026 at 1:13:41 PM

The comment feels a bit like Verdex may have dated a mathematician at some point and it went sour.

by Eddy_Viscosity2

3/29/2026 at 11:57:42 AM

I don't know, I think if you weighed up the costs of AI related datacentre spend vs. the average mathematics academic's salary you could come to a different conclusion.

by captainbland

3/29/2026 at 9:06:48 PM

Raising, nurturing, training, and mentoring an expert mathematician is not cheap; it never was, perhaps the first time in history when we can witness that rule to change - spinning up a bunch of math-savvy agents, each smarter than Ramanujan maybe will get too cheap.

by iLemming

3/30/2026 at 12:10:03 PM

You dont have to raise them, someone already did it, you have to hire them

by high_na_euv

3/30/2026 at 6:30:43 PM

You're oversimplifying the message I'm trying to convey. "you just hire them, someone already raised them" - treats mathematicians as a commodity stock rather than a flow. The conversation frames it as "mathematicians vs. RAM" - a cost comparison. But that's like comparing the cost of a GPS unit vs. a ship captain. The captain isn't expensive because they can calculate routes; they're expensive because they know when the route is wrong. AI makes the math cheaper but makes the mathematician more valuable, at least until true AGI genuinely surpasses human mathematical creativity - at which point we have much bigger economic questions than mathematician salaries.

The topic on itself is quite interesting, and far complex than supply/demand norms. Even before AI, there was and both wasn't shortage of mathematicians - academic pure mathematics - there's a glut. High school teachers - people exist; but they won't work for teacher salaries. Applied math - acute shortage - quant finance, ML research, cryptography, pharmaceutical modeling - we don't have enough. NSA - always struggled to hire - private sector salaries pull people away. Interdisciplinary - mathematical biology, climate modeling, materials science - domains where math is the bottleneck but the job title isn't really "mathematician" - acute shortage.

by iLemming

3/29/2026 at 12:46:39 PM

Doubt it. You have to pay these mathematicians once and then you can deploy to millions of sites.

by _fizz_buzz_

3/29/2026 at 12:32:33 PM

But not everyone has to pay mathematicians, like RAM :-)

by mandeepj

3/29/2026 at 12:31:38 PM

At the same time, processing is much cheaper than memory

by Almondsetat

3/29/2026 at 4:51:34 PM

Without memory you have no data to compute on. Memory and compute scaling only makes sense in tandem.

by gunalx

3/29/2026 at 10:47:47 AM

[dead]

by 3yr-i-frew-up

3/29/2026 at 10:50:05 AM

The same could be said about other IT domain... When you see single webpages that weight by tens of MB you wonder how we came to this.

by abdelhousni

3/29/2026 at 12:32:59 PM

Detachment from reality. Code elegance is more important then anything else. As simple as that.

by Yokohiii

3/30/2026 at 9:46:31 PM

You've never seen the sources for such pages I presume?

by LtWorf

3/29/2026 at 5:19:10 PM

> The obvious one outside of KV caches as mentioned above is vector databases. Any RAG pipeline that stores embedding vectors for retrieval benefits from the same compression. TurboQuant reduces indexing time to “virtually zero” on vector search tasks and outperforms product quantisation and RabbiQ on recall benchmarks using GloVe vectors.

This part sounds especially cool. I did not think about this application when reading the other articles about TurboQuant. It would be cool to have access to this performance optimization for local RAG.

by mxmlnkn

3/29/2026 at 9:02:35 AM

We will not see memory demand decrease because this will simply allow AI companies to run more instances. They still want an infinite amount of memory at the moment, no matter how AI improves.

by LoganDark

3/29/2026 at 11:12:08 AM

Jevons paradox https://en.wikipedia.org/wiki/Jevons_paradox

by jLaForest

3/29/2026 at 9:43:35 AM

If models become more efficient we will move more of the work to local devices instead of using SaaS models. We’re still in the mainframe era of LLM.

by jurgenburgen

3/29/2026 at 2:21:49 PM

We moved from the mainframe era to desktops and smaller servers because computers got fast enough to do what we needed them to do locally. Centralized computing resources are still vastly more powerful than what's under your desk or in a laptop, but it doesn't matter because people generally don't need that much power for their daily tasks.

The problem with AI is that it's not obvious what the upper limit of capability demand might be. And until or if we get there, there will always be demand for the more capable models that run on centralized computing resources. Even if at some point I'm able to run a model on my local desktop that's equivalent to current Claude Opus, if what Anthropic is offering as a service is significantly better in a way that matters to my use case, I will still want to use the SaaS one.

by rainsford

3/29/2026 at 2:53:31 PM

> Even if at some point I'm able to run a model on my local desktop that's equivalent to current Claude Opus, if what Anthropic is offering as a service is significantly better in a way that matters to my use case, I will still want to use the SaaS one.

Only if it's competitively priced. You wouldn't want to use the SaaS if the breakeven in investment on local instances is a matter of months.

Right now people are shelling out for Claude Code and similar because for $200/m they can consume $10k/m of tokens. If you were actually paying $10k/m, than it makes sense to splurge $20k-$30k for a local instance.

by lelanthran

3/29/2026 at 3:09:25 PM

The underlying advantage of local inference is that you're repurposing your existing hardware for free. You don't need your token spend to pay a share of the capex cost for datacenters that are large enough to draw gigawatts in power, you can just pay for your own energy use. Even though the raw energy cost per operation will probably be higher for local inference, the overall savings in hardware costs can still be quite real.

by zozbot234

3/29/2026 at 11:22:38 AM

The hyperscalers do not want us running models at the edge and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.

by throwatdem12311

3/29/2026 at 11:46:18 AM

> of circular fake money

Oh it gets worse than that, the money which caused all of this by OpenAI was taken from Japanese banks at cheap interest rates (by softbank for the stargate project), and the Japanese Banks are able to do it because of Japanese people/Japanese companies and also the collateral are stocks which are inflated by the value of people who invest their hard earned money into the markets

So in a way they are using real hard earned money to fund all of this, they are using your money to basically attack you behind your backs.

I once wrote an really long comment about the shaky finances of stargate, I feel like suggesting it here: https://news.ycombinator.com/item?id=47297428

by Imustaskforhelp

3/29/2026 at 7:32:51 PM

What is the difference between "hard earned" and not?

by joquarky

3/29/2026 at 8:55:10 PM

Well cartel money for example, depends on the definition of hard earned but I don't quite imagine for example the japanese Yakuza to deposit into banks/stock markets for example, I am not sure but I imagine something like gold/cash being used.

Maybe you can argue that yakuza is making hard earned money but imo, they are doing illegal activities within the law and are doing something more closer to extortion.

Ironically, in a sense, what AI did in a sense is also an extortion.

One is just legal (barely, I am not even sure how or why), the other isn't. That was my intention to highlight when I said hard earned money.

by Imustaskforhelp

3/29/2026 at 11:54:48 AM

> they will spend infinite amounts of circular fake money > forever

If that's the plan (there is no plan) then it expires at some point, because it's a spiral and such spirals always bottom out.

by topspin

3/29/2026 at 12:17:45 PM

And when that happens people STILL won’t be able to afford the hardware.

by throwatdem12311

3/29/2026 at 2:57:18 PM

> And when that happens people STILL won’t be able to afford the hardware.

Of course they will - if that happens all these AI token providers won't have a use for all that hardware they bought. You'll be buying used H100s and H200s off eBay for pennies on the dollar.

by lelanthran

3/29/2026 at 4:07:00 PM

No they won’t they’re just going to get absorbed into Azure and AWS and used for generic GPU compute that you rent until they’re burned out trash.

by throwatdem12311

3/30/2026 at 4:08:50 PM

Then those datacenters will barely need any new GPUs, so the companies making them will be desperate to get gamers to buy cards and set very competitive prices.

by Dylan16807

3/29/2026 at 12:42:40 PM

> and they will spend infinite amounts of circular fake money to ensure hardware remains prohibitively expensive forever.

That's ridiculous, "infinite money" isn't a thing. They will spend as much as they can not because they want to keep local solutions out, but because it enables them to provide cheaper services and capture more of the market. We all eventually benefit from that.

by naasking

3/29/2026 at 3:01:55 PM

> That's ridiculous, "infinite money" isn't a thing.

My reading of GP is that he was being sarcastic - "infinite amounts of circular fake money" is probably a reference to these circular deals going on.

If A hands B investment of $100, then B hands A $100 for purchase of hardware, A's equity in B, on paper, is $100, plus A has revenue of $100 (from B), which gives A total assets of $200.

Obviously it has to be shuffled more thoroughly, but that's the basic idea that I thought GP was referring to.

by lelanthran

3/29/2026 at 2:18:54 PM

Cheaper for who? For them maybe but certainly not for you or me.

by throwatdem12311

3/29/2026 at 9:59:21 AM

I don't think we are there yet. Models running in data centers will still be noticeably better as efficiency will allow them to build and run better models.

Not many people would like today models comparable to what was SOTA 2 years ago.

To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.

None of those two conditions seem to become true for the near future.

by DeathArrow

3/29/2026 at 1:17:20 PM

As I understand this advancement, this doesn't let you run bigger models, it lets you maintain more chat context. So Anthropic and OpenAI won't need as much hardware running inference to serve their users, but it doesn't do much to make bigger models work on smaller hardware.

Though I'm not an expert, maybe my understanding of the memory allocation is wrong.

by delecti

3/29/2026 at 2:10:11 PM

Seems to me if the model and the kv cache are competing for the same pool of memory, then massively compressing the cache necessarily means more ram available for (if it fits) a larger model, no?

by dd8601fn

3/29/2026 at 2:22:35 PM

Yes, but the context is a comparatively smaller part of how much memory is used when running it locally for a single user, vs when running it on a server for public... serving.

by delecti

3/29/2026 at 12:42:10 PM

I don't see how we'll ever get to widespread local LLM.

The power efficiency alone is a strong enough pressure to use centralized model providers.

My 3090 running 24b or 32b models is fun, but I know I'm paying way more per token in electricity, on top of lower quality tokens.

It's fun to run them locally, but for anything actually useful it's cheaper to just pay API prices currently.

by mustyoshi

3/29/2026 at 4:19:35 PM

AI is not cheap to run no matter where it is running. The price we get charged today for AI is a loss-leader. The actual cost is much higher, so much higher that the average paying user today would balk at what it actually costs to run. These AI companies are trying to get people hooked on their product, to get it integrated into every business and workflow that they can, then start raising prices.

by leptons

3/29/2026 at 12:59:22 PM

Until you put up your solar and then power is almost free...

by singpolyma3

3/29/2026 at 1:36:21 PM

The amortised cost including the panels and labour is nowhere near "almost free".

by vidarh

3/29/2026 at 4:24:31 PM

It is over a couple of years

by boredatoms

3/29/2026 at 8:34:36 PM

Even if you live somewhere where it does, that is not remotely "almost free", and lots of places the payback period is more in the range of 10-15 years even with subsidies.

by vidarh

3/29/2026 at 11:37:59 AM

> If models become more efficient

Then we can make them even bigger.

by Ray20

3/29/2026 at 11:41:44 AM

> Then we can make them even bigger.

But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"

There are some people here/on r/localllama who I have seen run some small models and sometimes even run multiple of them to solve/iterate quickly and have a larger model plug into it and fix anything remaining.

This would still mean that larger/SOTA models might have some demand but I don't think that the demand would be nearly enough that people think, I mean, we all still kind of feel like there are different models which are good for different tasks and a good recommendation is to benchmark different models for your own use cases as sometimes there are some small models who can be good within your particular domain worth having within your toolset.

by Imustaskforhelp

3/29/2026 at 12:44:00 PM

> But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"

It's simple: then we'll make our intents and purposes bigger.

by Ray20

3/29/2026 at 12:33:56 PM

Because the true goal is AGI, not just nice little tools to solve subsets of problems. The first company which can achieve human level intelligence will just be able to self-improve at such a rate as to create a gigantic moat

by Almondsetat

3/30/2026 at 4:15:07 PM

There's no particular reason to assume a human level AI would be able to improve itself any better than the thousands of human level humans that designed it.

by Dylan16807

3/30/2026 at 5:54:17 PM

Sure, but: that single human with the intelligence of a top tier engineer of scientist will have immediate access to all human knowledge. Plus, what do you think happens the moment its optimizes itself to run in 2, 4, 8, 16, etc. parallel instances?

by Almondsetat

3/30/2026 at 6:50:55 PM

Well, A) "top tier engineer/scientist" is a significant step above generic human, B) the human engineers/scientists also have immediate access to the same database, C) The humans have been optimizing it for even longer, so what makes us think the AI can optimize itself even a couple percent?

For example, if the number of AIs you can run per petaflop started to scale with the cube root of researcher-years, then even if your researcher AIs are quite fast and you can double your density in a couple years, hitting 5x will take a decade and hitting 10x will approach half a century.

by Dylan16807

3/29/2026 at 1:00:31 PM

> The first company which can achieve human level intelligence will just be able to...

They say prostitution is the oldest industry of all. We know how to achieve human-level intelligence quite well. The outstanding challenge is figuring out how to produce an energy efficient human-level intelligence.

by 9rx

3/29/2026 at 8:05:58 PM

But what about The Jevons Paradox?

by acuozzo

3/29/2026 at 11:04:39 AM

[flagged]

by ssyhape

3/29/2026 at 2:20:33 PM

MoE feels a lot more like engineering to me. You're routing around the problem rather than actually solving it. The real math gains are things like quantization schemes that change how information is actually represented. Whether that distinction matters long term probably will depend on whether we hit a capability wall first or an efficiency ceiling first.

by lucasfin000

3/29/2026 at 2:16:15 PM

I'm not sure that's infinitely true as long as AI costs to the user are proportional to the cost it takes to run the model. Even if user costs are heavily subsidized by investment, as long as they are non-zero and go up when models cost more, there will be at least some pressure for cheaper models and not just more capable ones and that pressure will go up with costs. AI is a crazy industry, but it's not totally immune to the law of supply and demand.

The real question though is how close are we to the point where the pressure is more for efficiency rather than capability. Anecdotally I think it's a ways off. Right now the general vibe I get is that people feel AI is very impressive for how cheap it is to use, which suggests to me that a lot of users would be very willing to pay more for more capable models. So the tipping point where AI hardware demand might slow down seems a ways off.

by rainsford

3/30/2026 at 1:18:07 AM

It has yet to be seen when hyperscalers will change their tune.

by LoganDark

3/29/2026 at 9:53:51 AM

I disagree. I think a sharp drop in memory requirements of at least an order of magnitude will cause demand to adjust accordingly.

by redrove

3/29/2026 at 11:34:21 AM

Department of Transportation always thinks adding more lanes will reduce traffic.

It doesn't, it induces demand. Why? Because there's always too many people with cars who will fill those lanes.

by cyanydeez

3/29/2026 at 11:42:58 AM

Citation needed. I've heard this quite often, but so far, I haven't seen proof of the stated causality.

PS: This doesn't mean that better public transportation could deliver more bang for the buck than the n-th additional car lane. But never ever have I heard from anybody that they chose to buy a car or use an existing car more often because an additional lane has been built.

by nkmnz

3/29/2026 at 12:01:05 PM

Have you tried the "Reference" section on the Wikipedia article?

https://en.wikipedia.org/wiki/Induced_demand#cite_note-vande...

by j16sdiz

3/29/2026 at 1:47:35 PM

You've never heard anyone choose to take side streets instead of the highway because of traffic jams? No one ever goes out of their way to avoid heavily trafficed areas?

by cyanydeez

3/31/2026 at 3:52:41 PM

I don't understand what the point is you're trying to make. When people at t0 take detours because of traffic jams on the direct route, and then at t1, there are less traffic jam on the direct route due to additional lanes, so they decide to take the direct route, then total traffic is down, because they no longer take a detour. Even if they are still part of a newly induced traffic jam.

by nkmnz

3/29/2026 at 10:49:03 AM

[dead]

by 3yr-i-frew-up

3/30/2026 at 1:11:07 AM

There's a bunch of research showing that more/better information doesn't reliably improve judgement, but better feedback on your existing predictions does. Makes me think of Soros and his whole thing about reflexivity.

by convexly

3/29/2026 at 1:21:09 PM

Sure, we need better math, it is obvious.

Unfortunately, nobody at big companies know, what exactly math will win, so competition not end.

So, researchers will try one solution, then other solution, etc, until find something perfect, or until semiconductors production (Moore's Law) made enough semiconductors to run current models fast enough.

I believe, somebody already have silver bullet of ideal AI algorithm, which will lead all us to AGI, when scaled in some big company, but this knowledge is not obvious at the moment.

by simne

3/29/2026 at 8:33:17 AM

This is one of the basic avenues for advancement.

Compute, bytes of ram used, bytes in model, bytes accessed per iteration, bytes of data used for training.

You can trade the balance if you can find another way to do things, extreme quantisation is but one direction to try. KANs were aiming for more compute and fewer parameters. The recent optimisation project have been pushing at these various properties. Sometimes gains in one comes at the cost of another, but that needn't always be the case.

by Lerc

3/29/2026 at 4:58:48 PM

There are techniques which already achieve great compression of the cache at 4 bit, eg using hadamard transforms. Going from 4 bit to 3 bit isn’t the great leap people expect this to be. It’s actually slower to run and is generally worse in practice.

by am17an

3/29/2026 at 2:15:43 PM

I mean, since GPT-4, I believe the RAM is no longer creating the miracle that the LLM performance scales directly with the model size. At least ChatGPT itself convinced me that any decent-sized company can create a GPT4 equivalent in terms of model size, but limited by service options, like memory cache and hallucination handling. Companies buy RAM simply to ride the stock hype.

I am no expert, so this is a shallow take, but I think the global LLM already reaches its limit, and general AGI could only be possible if it's living in the moment, i.e., retraining every minute or so, and associating it with a much smaller device that can observe the surroundings, like a robot or such.

Instead of KV cache, I have an idea of using LoRA's instead: having a central LLM unchanged by learning, surrounded by a dozen or thousands of LoRAs, made orthogonal to each other, each competed by weights to be trained every 1 min say. The LLM, since it's a RNN anyway, provides "summarize what your state and goal is at this moment" and trains the LoRAs with the summary along with all the observations and say inputs from the users. The output of the LoRAs feeds back to the LLM for it to decide the weights for further LoRAs training.

Anyways, I am just thinking there needs to be a structure change of some kind.

by SphericalCowww

3/30/2026 at 7:18:29 AM

Re continuous fine-tuning: how do you avoid catastrophic forgetting in your proposal?

by fittingopposite

3/31/2026 at 9:21:48 AM

My understanding is that this is what the LoRAs are for; my belief is that they serve as "memory" to their live observations (a more NN-like cache, say), while the main LLM remains unchanged. These LoRAs are also weighted, so that LoRAs irrelevant to the current task will not be trained, while the relevant LoRAs will be reinforced.

But I never built it, so I am not sure if such an emergent state will appear or not.

by SphericalCowww

3/29/2026 at 4:23:19 PM

share it on gh and make a show hn post about it, maybe you're right

the models are still very stupid atm something needs to change

by redanddead

3/29/2026 at 12:26:59 PM

Ive thought for a while that the real gains now will not come from throwing more hardware at the problem, but advances in mathematical techniques to make things for more efficient.

by alienbaby

3/29/2026 at 1:59:29 PM

I think the biggest issue isn’t the tool itself, but access and stability. I had more trouble finding reliable AI accounts than using them tbh

by PaddyLena

3/29/2026 at 1:54:31 PM

Is this something that will show up in Ollama any time soon to increase context size of local models?

by chr15m

3/29/2026 at 2:56:18 PM

KV quantization has long been available in llama.cpp

by zozbot234

3/29/2026 at 11:37:07 PM

Yes but the optimisation described has not right?

by chr15m

3/29/2026 at 4:02:17 PM

The TurboQuant paper is from April 2025. I’m sure the major labs knew about it on, or even before, the day it published. Any impact it had would have been a year ago. Yet I keep seeing these posts and discuss completely ignoring this.

Can we please start talking about this in that context? We already know what TurboQuant will do to DRAM demand. We already know what it will do to context windows. There is no need to speculate. There is no need to panic sell stocks.

by Skunkleton

3/29/2026 at 1:44:49 PM

I was thinking it needs speciality hardware. Sort of like how GPUs were born…

by exabrial

3/29/2026 at 1:13:07 PM

Does the KV cache really grow to use more memory than the model weights? The reduction in overall RAM relies on the KV cache being a substantial proportion of the memory usage but with very large models I can't see how that holds true.

by barbegal

3/29/2026 at 2:57:54 PM

For long context, yes this is at least plausible. And the latest models are reaching context lengths of 1M tokens or perhaps more.

by zozbot234

3/29/2026 at 1:43:45 PM

And maverick 2

by Bydgoszczo

3/29/2026 at 3:37:10 PM

this is exactly correct.

by effnorwood

3/29/2026 at 11:32:50 AM

Can we say something about the compression factor for pure knowledge of these models?

by amelius

3/29/2026 at 11:25:35 AM

Sigh. Don't make me tap the sign [1]

[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html

by tornikeo

3/29/2026 at 12:17:57 PM

Doesn't seem relevant here. TurboQuant isn't a domain-specific technique like the BL is talking about, it's a general optimisation for transformers that helps leverage computation more effectively.

by staminade

3/29/2026 at 2:29:10 PM

[dead]

by aaron695

3/29/2026 at 12:41:41 PM

> If I were Google, I wouldn’t release research that exposes a competitive advantage.

Isn't that a classic tit for tat decision and head for a loss?

Excellence and prestige are valuable too. You get those expensive ML for a small discount, public/professional perception, etc. Considering the public communication from Google, that isn't complete sociopathic, they know this war isn't won in one night, they are the only sustainably funded company in the competition. Surely they are at risk with their business, but can either go rampant or focus. They decided to focus.

by Yokohiii

3/29/2026 at 2:09:28 PM

why not, you know, just use LLMs to do this job ?

by signa11