alt.hn

3/28/2026 at 10:42:23 PM

From 300KB to 69KB per Token: How LLM Architectures Solve the KV Cache Problem

https://news.future-shock.ai/the-weight-of-remembering/

by future-shock-ai

3/31/2026 at 7:51:25 PM

There are also interesting approaches to more directly compress a large document or an entire codebase into a smaller set of tokens without getting the LLM to wing it. For example, Cartridges: <https://hazyresearch.stanford.edu/blog/2025-06-08-cartridges>

They basically get gradient descent to optimize the KV cache while freezing the network.

by coppsilgold

3/31/2026 at 6:09:19 PM

Unrelated, but 69KB is how much RAM Voyager 1 has.

by az09mugen

3/31/2026 at 7:21:35 PM

Voyager as a token of curiosity

by gregman1

3/31/2026 at 7:29:26 PM

[dead]

by refulgentis

3/31/2026 at 7:29:28 PM

good overview of the architecture side but worth mentioning there's another axis that stacks on top of all of this: you can quantize the kv cache itself at inference time. in llama.cpp you can run q8 for keys and q4 for values and it cuts cache memory roughly in half again on top of whatever gqa or mla already saves you. i run qwen 70b 4-bit on m2 max 96gb and the kv quant is what actually made longer contexts fit without running out of unified memory. keys need more precision because they drive attention scores but values are way more tolerant of lossy compression, so the asymmetry works out.

by LuxBennu

3/31/2026 at 7:47:26 PM

Some models really suffer badly from KV quantisation. You can also take a speed hit using dissimilar K and V types.

TurboQuant seems to be the next big thing in context memory usage. Polar coordinates achieving ~5x reduction in memory usage with minimal/no quality loss, and even a slight speedup in some cases.

by suprjami