Don't trust large context windows

6/14/2026 at 8:25:37 AM

I've been able to avoid context size issues by applying one simple constraint to my agent loop. What I do is prevent all tool calling in the user's top-level conversation thread. Anything that needs to tool call must happen in a recursive invoke of the agent, which returns whatever results to caller.

I can keep the same high level conversation going for an entire day over a million LOC+ codebase without ever hitting meaningful token limits. No compaction or summarization tricks needed. I can burn 50 million tokens in recursive calls and still not touch 100k tokens in my root conversation thread.

There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia, but this is still far more efficient than carrying around one big flat context that tries to cover everything all the time.

Recursion is very effective at controlling token use, but it can only go so far. I've not observed any uplift for recursive depth beyond 1. I have seen the agent attempt it a few times, but the practical performance is simply not there. External symbolic recursion does not appear to be something the frontier models have been trained for. They are fantastic at emulating recursion in context, but we don't want that if we are trying to achieve a reduction in token use.

by bob1029

6/14/2026 at 10:29:36 AM

I have a different way, but still trying to figure out how well it works. Instead of going into recursion, the agent is allowed to restart the thread by doing the summarize/debrief/reflect pass, writing key findings into persistent memory and rewriting the prompt whenever the context goes too large or it gets stuck. Recursion with TCO if you may.

In a way it's a generalization of the spec-driver approach, but in addition to the the formal spec the carryover buffer lives in the memory.

by Muromec

6/14/2026 at 11:00:38 AM

Kiro does this automatically from what I can tell using it

by stogot

6/14/2026 at 9:44:48 AM

This makes intuitive sense. Can I ask what harness you're using that allows you to configure the constraint and how?

by gbro3n

6/14/2026 at 10:22:56 AM

For anyone using Claude Code, ask it to do all the work in workflows (it has a tool for that), they released that feature together with Opus 4.8 and it also seems a bit better at doing long tasks as well. The main conversation just orchestrates the work at that point.

by KronisLV

6/14/2026 at 10:03:35 AM

This is interesting to me because reducing context & token usage is in the user's best interest but not in the financial interest of AI vendors. I am not an expert but it sounds like your "one simple trick" would fix context issues and allow much tighter control over token usage. Thanks for being willing to share this tip in an HN comment, changing how those in the know use AI agents going forward -- it's hard to keep up!

by password4321

6/14/2026 at 9:06:46 AM

How do you get the agent to stick to it without constantly rejecting tool calls with the same description? I've tried a similar setup a number of times and it tends to forget about this constraint very quickly.

by Etheryte

6/14/2026 at 9:31:35 AM

The tool itself enforces the constraint. This is deterministic. If an agent tries to read a big fat file in root, it gets an error from that tool's implementation that reiterates the requirement.

I don't bother warning it in the system prompt anymore. It's pointless. I let it bump its head as required. A few hundred tokens and the agent is back on track each time.

by bob1029

6/14/2026 at 9:20:15 AM

If the model isn't following the system/developer prompts easily, you might want to try a bigger/better model, tends to mostly be about model quality if it doesn't follow what you tell it to. Besides that, conflicting directions in the system/developer prompts can lead to the model seemingly ignoring instructions too.

by embedding-shape

6/14/2026 at 9:16:57 AM

Which tools? Even file reads and writes?

by WithinReason

6/14/2026 at 9:26:54 AM

Especially these things.

The only tools permissible to root in my scheme are call() and return().

by bob1029

6/14/2026 at 9:42:49 AM

Is it in pi.dev? Don't thinking tokens still take up context?

by WithinReason

6/14/2026 at 7:36:44 AM

This has not been my experience with Opus since Anthropic released the 1M token context window for use under the subscription plans. I routinely push past 500k tokens, even sometimes up to around 800k tokens, and don't see this problem. I've seen it to some extent when getting truly near the limit, up around and above 900k tokens, though what I see isn't as severe as the author seems to see.

(And I rarely fill the context window that far anyway when working on a single task, or a series of tasks that are related enough to warrant the same context; more typical is anywhere between 200k and 600k or so.)

I'm not saying that no one ever has this experience, but it's odd to me that some people see it so often that it warrants giving it a name.

by kelnos

6/14/2026 at 8:01:24 AM

I see this said often and find it insane given how many times I find opus models making basic recall mistakes at <100k tokens.

Personally I consider < 60k to be the smart zone for opus. This is worse for opus 4.7 and 4.8 cause of the more granular tokenizer

by Bolwin

6/14/2026 at 8:06:08 AM

60k is tiny, if it's making recall mistakes that early then you might have some false memories or incorrect instructions in your CLAUDE.md.

60k isn't much bigger than the system prompt.

by eterm

6/14/2026 at 8:16:12 AM

Yeah 60k is ludicrous, I've barely seeded the context at that point and I don't see context related degradation until well into the 600-700k.

by danielbln

6/14/2026 at 9:23:34 AM

> I've barely seeded the context at that point

I think that's issue, rather than 60K being small.

Most of the actual edits/changes I request to codex are solved within 100-150K tokens, beyond 200K I'd definitively try to restart the session as soon as I could as all models are horrible once you get across ~20% of the total context size. And this is while working on +million LOC codebases.

Problem I guess is that there is no solid and concrete evidence of this (to me [and others seemingly] obvious) degradation, but should be easy to prove, yet no one has time to sit down and show it :)

But the likelihood of a model getting minor details wrong once you're above some magical threshold between 15-20%, seems to skyrocket, and I hit that issue sufficient amount of times that now my workflow is trying to prevent that.

by embedding-shape

6/14/2026 at 9:31:31 AM

In this thread: People tossing coins independently and fighting over the result they got.

by qsera

6/14/2026 at 8:18:48 AM

>you might have some false memories or incorrect instructions in your CLAUDE.md

    "YOU'RE HOLDING IT WRONG!"

by da_grift_shift

6/14/2026 at 8:41:54 AM

did you internalize what was wrong with that quote when it was said? does it apply here?

by RugnirViking

6/14/2026 at 8:56:26 AM

[dead]

by perching_aix

6/14/2026 at 8:44:39 AM

I'm always a bit confused when people say things like this. 60k token is often more than the initial context I feed the model with. And I don't think I ever had a productive session that began under 150k tokens.

by CjHuber

6/14/2026 at 9:25:19 AM

Bit of what makes it so fun, our experiences seem to wildly differ! On one hand, you have experiences like yours, but then my own experience is that I never had a productive session when the scope grows beyond 150K tokens! If I needed 60K just as a starting context, I'd take that to mean the suggested change is way to large, and if the model cannot solve the entire thing within maybe 15-20% of the total context size, divide and conquer is needed otherwise there will be a lot of time wasted to patch things up when things are "completed".

by embedding-shape

6/14/2026 at 10:41:57 AM

Yeah indeed it's very interesting. And the 60k initial context don't even contain the suggested change yet. For me if I don't do this the current models tend to fixate and local patches instead of tracing symbols and making a holistic model of what a change interacts with in the codebase

by CjHuber

6/14/2026 at 8:12:22 AM

Not specific to Opus but yes it would make mistakes. I usually try to keep context window under 10%

by wg0

6/14/2026 at 8:13:27 AM

I hate to do the "you're holding it wrong" trope, but I think you might have something misconfigured somewhere unless you missed a 0, because just past 60k tokens is such a small context window to be seeing issue in.

Do you have any old documentation that it's picking up and referencing? If you set all claude settings back to default do you see the same issue?

by properbrew

6/14/2026 at 7:52:45 AM

Opus 4.6 was on drugs past 200k, I skipped 4.7, 4.8 did good up to ~350k, and Fable did great beyond 400k, in my limited testing. The quality does appear to be trending upwards.

by arcanemachiner

6/14/2026 at 10:34:26 AM

> Opus 4.6 was on drugs past 200k

Which drugs?

by throwaway314155

6/14/2026 at 11:02:04 AM

The way it hallucinates stuff, it'd probably be something in the LSD family. ;)

by justinclift

6/14/2026 at 7:51:31 AM

Thats another problem of this post, the author mentions Claude but not explicitely what models...

100k tokens "by lunch" is also not my finding, the newer models will hit that already right in the initial exploratory phase

by fullstackchris

6/14/2026 at 7:53:09 AM

Really depends on the project.

by arcanemachiner

6/14/2026 at 8:19:29 AM

I found "by lunch" odd too, but considering that Claude wrote the article, it's not going to know specifics.

by stavros

6/14/2026 at 7:45:23 AM

I’ve had similar experiences with Fable. 70%+ context used out of 1M, still sharp and no memory issues.

by asd88

6/14/2026 at 8:19:19 AM

I have a custom build command for a rust project (yarn build:lib) and my experience is 120k for GLM and roughly 200-300k for Opus. After that, they default to cargo build.

by csomar

6/14/2026 at 8:24:31 AM

My projects have specific build/verify steps as well, and after a certain point Claude forgets to run them. I’m going to try a “No brown M&Ms” hook to halt Claude if it tries to run the default command instead of the instructed commands from CLAUDE.md. Perhaps this will be a good signal that a compacted or fresh session is needed at that point to avoid mistakes.

by trapexit

6/14/2026 at 8:02:51 AM

As the gamblers say at the poker table: If you can't figure out who the mark is when you site down...

by cyanydeez

6/14/2026 at 8:38:50 AM

Opus in recent versions is fine beyond 100k, but I usually do try to keep it under 200k.

But, this is also why so-called "memory" systems are usually a mistake that make the models dumber. They don't have memory, they only have context, and every irrelevant fact you shove into the context is less context for the problem. Less distractions, better results.

The way to have the agent remember things is to have it document its work, like a human developer would do if they wanted their project to be friendly to other developers working on it. Good developer docs with an index page and a good plan with checklists, in concise Markdown files, checked in to the repo is the ideal memory for models and the ideal docs you need to figure out WTF the model has been up to. Helps with code review, too, whether by humans or another model. There's no down side.

by SwellJoe

6/14/2026 at 8:54:48 AM

At least for me, Opus keeps writing stuff to memories, only to consistently forget checking those memories before doing the same mistake again. This ("remember to check memories!") is of course then again written as a memory... Clearly not a very well working system, yep.

by endless1234

6/14/2026 at 11:05:20 AM

I've explicitly banned Opus from creating memories unprompted, as it would often save info that's incorrect and which would then be propagated to future sessions until caught. Ugh x 10.

by justinclift

6/14/2026 at 9:57:24 AM

Yeah, I see it write stuff to memory pretty regularly, maybe it works sometimes, but for things I want it to stop doing or always do, I make it impossible to do otherwise via lint or some style enforcement, or via a test that fails if code shows up that violates the constraint.

But, it does a good job following existing conventions in a codebase, as long as they're really consistent. So the more actively you enforce that consistency the more likely it is to do the right thing without memories or prompting.

I don't like "never do" or "always do" type rules in AGENTS.md or in memory, as it often over-interprets them and ties itself in knots trying to satisfy an impossible set of goals.

by SwellJoe

6/14/2026 at 9:05:01 AM

In my own multi agent framework I use cheap models to check the responses of the expensive models, as well as using multiple expensive models adversarially in debate. The cheap models are great at spotting eg the model getting stuck in the alternate between two broken ideas or not following code conventions or missing a step in the skill and so on. I’m currently working on making them detect user corrections and police that going forward to intervene when the expensive models forget the thing you just corrected them about etc.

by wood_spirit

6/14/2026 at 7:22:21 AM

I'm getting a lot of mileage out of basically acting like the AI's Product Manager, and insisting that it writes up short PRDs for every feature we propose to build. That gives it a reference over time of everything that has been built, but also makes it less liable to drift with each one. Each one gets its own conversation. For me this is a happy medium between stopping it going off the rails but also making sure it can reference past decisions when it needs to. The one thing I dislike about Pocock's method (not to use PRDs so much but to have an in depth discussion to get alignment) first is it wastes a lot of the best window on that initial back and forth.

by kristianc

6/14/2026 at 7:40:14 AM

Is it adhoc or you use more structured approaches like openspec? I also tend to work on a plan first, but it stays as in-session todo, which is hard to reference later.

by nopurpose

6/14/2026 at 7:48:02 AM

It's ad hoc / my own framework, just found something which works for me. The exact structure is

- Work Mode - HITL/AFK

- Problem Statement

- Who It Affects - Primary / Secondary User

- User Stories

- Business Case

- Why Now

- Success Critera

- In Scope/Out of Scope [Out of Scope v. important)

- Thinnest Slice (This I've found super valuable, means you max out the amount of 'product' for your buck and avoid diminishing marginal returns or overbuilding. Often I will build this)

- Eigenfeature - What is the larger feature we _could_ (but probably won't) which would solve for this use case and other stuff I might not have thought of

- Technical Notes

- Deps

- Schema Changes

- Risks

- Final Recommendation [go / no go, including on scope]

There's a note in my Claude / Agents MD which says no net new feature gets introduced without this and I get it to move through a pipeline of folders (active, approved, shipped, proposed etc). All runs in a system of MD files and have even created a little MD Kanban from the metadata!

by kristianc

6/14/2026 at 8:49:34 AM

I guess I've stumbled into something similar. Though I don't have a fixed format like yours. I first do a lot of back and forth to generate what I call a design document also includes rationales for various points or decisions. I use both Claude and Codex to iterate on this until I'm happy. The end result includes a lot of what you mention.

I then start a fresh conversation, make it analyze the design document and code, and for larger changes, generate a high-level implementation document which includes concrete phases or steps. I review this plan and iterate if necessary.

Then for each phase I make it generate a detailed plan for that phase and save it along side the other documents. Once the phase is over, I make it write a summary of what was done, decisions made and reasons for it. And typically a good point to compact the model's context.

These documents gives additional context for when I make another model do code review, and help illuminate drift or gaps from the main design document.

by magicalhippo

6/14/2026 at 10:25:50 AM

I found myself in a similar workflow. Depending on the task at hand (starting a new project, enhancement, maintenance), I let the agent create/read the markdown files that I keep updated (AGENT, STATE, ROADMAP, DESIGN, ARCHITECTURE, (CODESTYLE if I plan to modi it myself)). Then I choose the various roles that I need in this session and and have a planning phase. After that, the agent is starting implement the changes and I have a manual correction phase.

This flow works for my needs, building idea demos, prototypes or tools for my own sake. I don't let agent code in our main code base where everything is still hand tailored. That's a conscious decision.

I noticed that the cheaper models (flash, ...) are quite hard to hold back changing files. A question for possible options sometimes results in "yes, I'll go with option A" without asking back. Frontier models on the other hand love to plan and ask you deliberately for your consent.

I use pi.dev with almost no skills at all to understand how models really work and "feel" to work with.

by SeriousM

6/14/2026 at 8:22:27 AM

Is there back-and-forth? How long do these get? Can you share an example?

by da_grift_shift

6/14/2026 at 9:43:55 AM

I'm actually doing a big refactoring in a project where if everything gets loaded (code / docs), the context gets like 750k filled (Opus 4.8), and then the agent has the remaining ~200k to do actual coding, until I have to reset. I haven't finished the work but I'm like 80% there, and it seems the progress is good and the quality is also good, verified by doing some performance tests and a lot of comparisons between outputs between the original code and the new one.

Maybe I could achieve better and quicker results with keeping the context in the proper zone, but trying it will have to wait until the next project.

by faeyanpiraat

6/14/2026 at 10:48:02 AM

Can anybody explain me why just not limit the context window to something smaller instead of all that context engineering? It forces things to be constrained.

by amunozo

6/14/2026 at 7:37:27 AM

Considerations about what goes on in agents internally will probably not be part of software development for long.

Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.

To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best. This score is imporant to me, not so much how the agent achieves it.

by mg

6/14/2026 at 7:45:17 AM

This is an absolutely crazy wasteful thing to do considering the actual cost of all that inference and nothing to be proud of.

by hypfer

6/14/2026 at 8:30:20 AM

Unless we do our own benchmarks, we have to take all the marketing fluff from the frontier labs at face value, and all public benchmarks degrade eventually as labs optimize towards them. OP’s approach is wasteful because it is brute force, but post says that an ELO is kept, so this is also an experiment, and I don‘t see what‘s wrong with that. You learn which model performs well in which settings which may save resources later. It‘s also wasteful to keep working with the wrong model/harness/tools for too long.

by loehnsberg

6/14/2026 at 8:17:03 AM

It is the other way round.

In an interactive session, adding "Fine, but make the button red" after the model generated a first solution more than doubles the tokens used. As the model now not only gets the original code and the feature request but also the updated code plus the change request as input tokens.

Sending a feature request to an LLM and then sending the feature request again with "The button shall be red" only doubles the tokens used.

by mg

6/14/2026 at 8:30:23 AM

The cost is far from linear though. Because of prompt caching and the fact that generally output tokens are a lot more expensive than input tokens.

by jgilias

6/14/2026 at 8:48:07 AM

Agreed that it is not linear.

I wrote my own agent, and it sends data to LLMs in this order: "General Prompts (How to write good code)" + "The Code" + "The Feature Request". This means the KV cache will be used even when the feature request changes.

And output tokens are usually way less than the input tokens.

So I think that my approach is very lightweight on token usage compared to an interactive session.

It would be interesting to measure it for the other agents out there. Sending a feature request two times vs an interactive session.

by mg

6/14/2026 at 8:35:23 AM

"Make the button red" probably doesn't need an LLM at all.

by ryan_glass

6/14/2026 at 10:16:13 AM

One tends to use LLMs for everything in practice. It‘s inconvenient to switch mode of operation

by Tepix

6/14/2026 at 8:32:17 AM

That’s usually not true due to caching. It may be true if you leave a large gap in between, but if you send “make it red” right after, then it’s purely incremental

by Chirono

6/14/2026 at 8:18:40 AM

Probably like 1% of the energy an average person spends on driving.

by redox99

6/14/2026 at 8:25:33 AM

Average american is what you mean

by Raphael_Amiard

6/14/2026 at 8:05:08 AM

come on now, we can't just not escape the permanent underclass by using our brains, we've also got to use up all the resources while doing it.

by cyanydeez

6/14/2026 at 8:54:25 AM

Which model is leading the pack for you?

by perching_aix

6/14/2026 at 9:21:27 AM

From the SOTA model providers, I only use OpenAI and Google. And between gpt-5.5 and gemini-3.1-pro-preview, gpt-5.5 is currently leading.

by mg

6/14/2026 at 8:59:12 AM

Yes context management is key.

I do my own framework and spend a lot of time trying to debug this and it’s not so much the context size in hard numbers but rather the probability that there is debris or wrong directions in the window that are drowning out the things the user thinks are important.

This manifests in the llm that keeps going back to doing the thing that failed when they tried it just before the last approach etc. The frequency of things in the context window give weight even if they are the wrong things.

I have a lot of tricks like not giving the llm lots of tools but rather giving it a tool it can use to search for tools etc.

But the bigger solution is in process where you use something like superpowers to force the llm through stages and you control the context that carries forward.

by wood_spirit

6/14/2026 at 9:34:31 AM

Considering how expensive context is in terms of compute, I wonder why (and if ) vendors don't invest more into context engineering.

When it comes to source code, I feel like LLMs could just as well work with something like minified source code, if an LLM is trained on programming well, I think there's no reason why something like a variable should be represented by something more than a single token. Comments can be discarded, etc. In fact considering embeddings for LLMs are very rich, I think common ops could be reduced to a single token.

Imo that's why LLMs are soo good at reverse engineering. A lot of the time, assembly (with symbols) is pretty close to the source code, but compressed and encoded, and if you're familiar with the patterns of your compiler, reversing it is not that difficult.

Anyways, context engineering could be huge boon to input token curation imo (and maybe it already is)

by torginus

6/14/2026 at 9:45:46 AM

Why is it surprising that, at some point, more information will lead to worse performance?

It seems obvious. Moreover, in a simple model, it seems like whatever tokens you do add have to have MORE information than the average in the existing window.

In a non-trivial model (and this is the model I would choose), since you are adding them to the end, they likely have to have MUCH more information.

Proof as always is an exercise to the reader.

by RandyRanderson

6/14/2026 at 7:42:14 AM

I've had no problem with Claude Code Opus 4.8 effort max using 20% token context (200k) on software development tasks (all stages). I aways load core source files and the ones we are working on up front. Around 20%, I make it autoprepare for a new session and clear.

Admittedly I have been doing this precautiously, based on anecdotal evidence, not because I had bad experiences with longer context deterioration myself.

In the brief time I had access to Fable 5, it went on long running tasks (>45 mins) into the 30-40% zone without apparent context coherence problems.

by PeterStuer

6/14/2026 at 9:23:41 AM

I built a very small personal extension for Pi [1] that gives me a /last command. It clears the entire session, only retaining the agent's last output message. This allows me to do manual "compaction". Basically I tell the agent something like "state the plan as discussed with references to files that should be edited", and call /last, then tell it to implement.

[1] https://pi.dev/

by WilcoKruijer

6/14/2026 at 7:22:15 AM

The approach we're taking to deal with this very real context rot is using a bunch of related techniques which we call transposing the agent loop: https://alejo.ch/3jt

In essence, we run many short agent loops, generating their prompts dynamically from structured data. Each loop advances the state in a small step towards the final goal.

by afc

6/14/2026 at 8:49:05 AM

I wonder how much this depends on the quality and consistency of the context?

For example, it may be the case that a long context full of useful information relevant to the task is completely fine, perhaps even beneficial. And if the context contains a bunch of unrelated tangents and conflicting instructions, then it will be detrimental.

Have there been studies on what makes models get dumber? To what extent is context length to blame vs context quality?

by steveridout

6/14/2026 at 9:49:56 AM

> The number on the box gets bigger every release.

Not really tho right? Since we got to 1m context in mid 2025 nearly no one has gone higher.

by rsanek

6/14/2026 at 7:18:32 AM

I /clear all the time out of habit. I want to be able to get the thing done with minimal context. It also means you can do it again slightly different if needed, you know the seed conditions for the task.

by mcapodici

6/14/2026 at 9:38:48 AM

It is a lot like giving a person instructions, the more you tell them, the more they will forget the specifics.

by dalemhurley

6/14/2026 at 8:56:26 AM

Evaluating the Sensitivity of LLMs to Prior Context

https://arxiv.org/abs/2506.00069

by cowang

6/14/2026 at 6:19:02 AM

Perhaps compacting the context can be made in multiple requests over smaller and overlapping chunks to avoid using the 'dumb zone', and for yielding a better result.

by da-x

6/14/2026 at 7:49:42 AM

Even taking the author's criticism about large context windows for granted, which in my experience are exaggerated, they are still a huge UX improvement over short windows. That reason alone is enough for me to support them.

by mightyham

6/14/2026 at 8:17:32 AM

In my own testing I have seen peak performance happen usually within 15-20% of the intended context limit, albeit there are a few optimizations depending on the task quality.

by jackxlau

6/14/2026 at 8:06:50 AM

There's an env var you can set in Claude Code to bring the autocompact threshold down, effectively setting your own max context window. I have it at 400k.

by walthamstow

6/14/2026 at 8:36:27 AM

i let the main loop spawn sub terminal via tmux to prevent large contexts. it's great to divide tasks in small patterns and consolidate it step by step.

by Febriss33

6/14/2026 at 8:42:59 AM

Even better, don't trust LLMs at all.

by BrenBarn

6/14/2026 at 9:46:56 AM

aka Softmax context rot

by woadwarrior01

6/14/2026 at 8:03:04 AM

Is there any chance that this is because training corpus largely consists of documents shorter than the advertised context windows?

by petesergeant

6/14/2026 at 7:47:45 AM

Hasn’t been my experience at all - 1M window is a very clear upgrade working with Claude code.

by mock-possum

6/14/2026 at 10:28:25 AM

[flagged]

by ashish296

6/14/2026 at 8:09:01 AM

[flagged]

by 3vo-ai

6/14/2026 at 7:39:22 AM

[flagged]

by Dollarland

6/14/2026 at 6:18:32 AM

[flagged]

by breakthematrix