1M context is now generally available for Opus 4.6 and Sonnet 4.6

3/14/2026 at 4:33:29 PM

Interesting, I’ve never needed 1M, or even 250k+ context. I’m usually under 100k per request.

About 80% of my code is AI-generated, with a controlled workflow using dev-chat.md and spec.md. I use Flash for code maps and auto-context, and GPT-4.5 or Opus for coding, all via API with a custom tool.

Gemini Pro and Flash have had 1M context for a long time, but even though I use Flash 3 a lot, and it’s awesome, I’ve never needed more than 200k.

For production coding, I use

- a code map strategy on a big repo. Per file: summary, when_to_use, public_types, public_functions. This is done per file and saved until the file changes. With a concurrency of 32, I can usually code-map a huge repo in minutes. (Typically Flash, cheap, fast, and with very good results)

- Then, auto context, but based on code lensing. Meaning auto context takes some globs that narrow the visibility of what the AI can see, and it uses the code map intersection to ask the AI for the proper files to put in context. (Typically Flash, cheap, relatively fast, and very good)

- Then, use a bigger model, GPT 5.4 or Opus 4.6, to do the work. At this point, context is typically between 30k and 80k max.

What I’ve found is that this process is surprisingly effective at getting a high-quality response in one shot. It keeps everything focused on what’s needed for the job.

Higher precision on the input typically leads to higher precision on the output. That’s still true with AI.

For context, 75% of my code is Rust, and the other 25% is TS/CSS for web UI.

Anyway, it’s always interesting to learn about different approaches. I’d love to understand the use case where 1M context is really useful.

by jeremychone

3/14/2026 at 5:28:15 PM

Yeah this is the simpler and also effective strategy. A lot of people are building sophisticated AST RAG models. But you really just need to ask Claude to generally build a semantic index for each large-ish piece of code and re-use it when getting context.

You have to make sure the semantic summary takes up significantly less tokens than just reading the code or its just a waste of token/time.

Then have a skill that uses git version logs to perform lazy summary cache when needed.

by daemonk

3/14/2026 at 5:05:25 PM

It seems like a very good use of LLMs. You should write a blog post with detail of your process with examples for people who are not into all AI tools as much. I only use Web UI. Lots of what you are saying is beyond me, but it does sound like clever strategy.

by smusamashah

3/14/2026 at 6:48:50 PM

Yeah we all converge to the same workflow, in my ai coding agent I'm working on now, I've added an "index" tool that uses tree-sitter to compress and show the AI a skeleton of a code file.

Here's the implementation for the interested: https://github.com/tontinton/maki/blob/main/maki-code-index%...

by tontinton

3/14/2026 at 7:08:48 PM

Oh, that's great.

I've always wanted to explore how to fit tree-sitter into this workflow. It's great to know that this works well too.

Thanks for sharing the code.

(Here is the AIPack runtime I built, MIT: https://github.com/aipack-ai/aipack), and here is the code for pro@coder (https://github.com/aipack-ai/packs-pro/tree/main/pro/coder) (AIPack is in Rust, and AI Packs are in md / lua)

by jeremychone

3/14/2026 at 4:35:52 PM

whenever I see post like this

i said well yeah, but its too sophiscated to be practical

by firemelt

3/14/2026 at 5:35:46 PM

Fair point, but because I spent a year building and refining my custom tool, this is now the reality for all of my AI requests.

I prompt, press run, and then I get this flow: dev setup (dev-chat or plan) code-map (incremental 0s 2m for initial) auto-context (~20s to 40s) final AI query (~30s to 2m)

For example, just now, in my Rust code (about 60k LOC), I wanted to change the data model and brainstorm with the AI to find the right design, and here is the auto-context it gave me:

- Reducing 381 context files ( 1.62 MB)

- Now 5 context files ( 27.90 KB)

- Reducing 11 knowledge files ( 30.16 KB)

- Now 3 knowledge files ( 5.62 KB)

The knowledge files are my "rust10x" best practices, and the context files are the source files.

(edited to fix formatting)

by jeremychone

3/15/2026 at 10:40:13 AM

How do you re-evaluate your approach? I'm asking because the landscape, at least from my lens, was completely different a year ago. So I fear that as the foundation shifts whatever learnings, approaches and mental models I have risk being obsolete and starts to work against me.

The problem of evaluating is hard enough as it is without layers of indirection built on top of it.

by tjoff

3/14/2026 at 5:23:25 PM

It's not sophisticated at all, he just uses a model to make some documentation before asking another model to work using the documentation

by adammarples

3/14/2026 at 7:52:00 PM

I built myself an AST based solution for that during the last 6 months roughly. I always wondered whether grep and agent-based discovery will be the end of it and thought it just has to be better with a more deterministic approach.

In the end it's hard to measure but personally I feel that my agent rarely misses any context for a given task, so I'm pretty happy with it.

I used a different approach than tree-sitter because I thought I found a nice way to get around having to write language-specific code. I basically use VSCode as a language backend and wrote some logic to basically rebuild the AST tree from VSCode's symbol data and other API.

That allows me to just install the correct language extension and thus enable support for that specific language. The extension has to provide symbol information which most do through LSP.

In the end it was way more effort than just using tree-sitter, however, and I'm thinking of doing a slow migration to that approach sooner or later.

Anyways, I created an extension that spins up an mcp server and provides several tools that basically replace the vanilla discovery tools in my workflow.

The approach is similar to yours, I have an overview tool which runs different centrality ranking metrics over the whole codebase to get the most important symbols and presents that as an architectural overview to the LLM.

Then I have a "get-symbol-context" tool which allows the AI to get all the information that the AST holds about a single symbol, including a parameter to include source code which completely replaces grepping and file reading for me.

The tool also specifies which other symbols call the one in question and which others it calls, respectively.

But yeah, sorry for this being already a quite long comment, if you want to give it a try, I published it on the VSCode marketplace a couple of days ago, and it's basically free right now, although I have to admit that I still want to try to earn a little bit of money with it at some point.

Right now, the daily usage limit is 2000 tool calls per day, which should be enough for anybody.

Would love to hear what you think :)

<https://marketplace.visualstudio.com/items?itemName=LuGoSoft...>

by lukeundtrug

3/15/2026 at 5:37:53 PM

I looked at your solution and extension README, and it's very interesting and well thought out.

The fact that you've been using it for six months and that it performs well says a lot. At the end of the day, that's what counts.

I like your idea of piggybacking on top of the LSP services, and I can imagine that this was quite a bit of work. Doing it as an MCP server makes it usable across different tools.

I also really like the name "Context Master."

In my case, it's much more niche since it's for the tool I built. Though it's open source, the key difference is that the "indexing" is only agentic at this point.

I can see value in mixing the two. LSP integration scares me because of the amount of work involved, and tree-sitter seems like a good path.

In that case, in the code map, for each item, there could be both the LLM response info and some deterministic info, for example, from tree-sitter.

That being said, the current approach works so well that I think I am going to keep using and fine-tuning it for a while, and bring in deterministic context only when or if I need it.

Anyway, what you built looks great. If it works, that's great.

by jeremychone

3/16/2026 at 9:38:36 PM

Thanks for taking the time to check it out and for the kind words! I really appreciate it.

I totally get sticking with your current approach. Your workflow sounds very intriguing as well. A combination of both approaches might really be very interesting :) Adding an LLM interpretation layer on top of my graph is also something I'm actively considering.

Thanks for the great discussion, and best of luck with your tool and workflow!

by lukeundtrug

3/14/2026 at 4:56:50 PM

This is really interesting; ive done very high level code maps but the entire project seems wild, it works?

So, small model figures out which files to use based on the code map, and then enriches with snippets, so big model ideally gets preloaded with relevant context / snippets up front?

Where does code map live? Is it one big file?

by cloverich

3/14/2026 at 5:45:33 PM

So, I have a pro@coder/.cache/code-map/context-code-map.json.

I also have a `.tmpl-code-map.jsonl` in the same folder so all of my tasks can add to it, and then it gets merged into context-code-map.json.

I keep mtime, but I also compute a blake3 hash, so if mtime does not match, but it is just a "git restore," I do not redo the code map for that file. So it is very incremental.

Then the trick is, when sending the code map to AI, I serialize it in a nice, simple markdown format.

- path/to/file.rs - summary: ... - when to use: ... - public types: .., .., .. - public functions: .., .., ..

- ...

So the AI does not have to interpret JSON, just clean, structured markdown.

Funny, I worked on this addition to my tool for a week, planning everything, but even today, I am surprised by how well it works.

I have zero sed/grep in my workflow. Just this.

My prompt is pro@coder/coder-prompt.md, the first part is YAML for the globs, and the second part is my prompt.

There is a TUI, but all input and output are files, and the TUI is just there to run it and see the status.

by jeremychone

3/14/2026 at 5:46:27 PM

1M context is super useful with Gemini, not so much for coding, but for data analysis.

by CuriouslyC

3/14/2026 at 7:10:39 PM

Even there, I use AI to augment rows and build the code to put data into Json or Polars and create a quick UI to query the data.

by jeremychone

3/14/2026 at 7:57:48 PM

  > - a code map strategy on a big repo. Per file: summary, when_to_use, public_types, public_functions. This is done per file and saved until the file changes. With a concurrency of 32, I can usually code-map a huge repo in minutes. (Typically Flash, cheap, fast, and with very good results)

Thanks, but why use any AI to generate this? I would say: you document your functions-in-code, types are provided from the compiler service, so it should all be deterministically available in seconds iso minutes, without burning tokens. Am I missing something?

by exceptione

3/14/2026 at 8:44:50 PM

Very good point. I had two options:

1) Deterministic

  - Using a tree-sitter/AST-like approach, I could extract types, functions, and perhaps comments, and put them into an index map.

  - Cons:

    - The tricky part of this approach is that what I extract can be pretty large per file, for example, comments.

    - Then, I would probably need an agentic synthesis step for those comments anyway.

2) Agentic

  - Since Flash is dirt cheap, I wanted to experiment and skip #1, and go directly to #2.

  - Because my tool is built for concurrency, when set to 32, it's super fast.

  - The price is relatively low, perhaps $1 or $2 for 50k LOC, and 60 to 90 seconds, about 30 to 45 minutes of AI work.

  - What I get back is relatively consistent by file, size-wise, and it's just one trip per file.

So, this is why I started with #2.

And then, the results in real coding scenarios have been astonishing.

Way above what I expected.

The way those indexes get combined with the user prompt gets the right files 95% of the time, and with surprisingly high quality.

So, I might add deterministic aspects to it, but since I think I will need the agentic step anyway, I have deprioritized it.

by jeremychone

3/14/2026 at 5:43:38 PM

I think you've kind of hit on the more successful point here, which is that you should be keeping things focused in a sufficiently focused area to have better success and not necessarily needing more context.

by speakbits

3/14/2026 at 7:06:14 PM

Well, out of all the workflows I have seen, this one is rather nice, might give it a try.

I imagine if the context were being commited and kept up-to-date with CI would work for others to use as well.

However, I'm a little confused on the autocontext/globs narrowing part. Do you, the developer, provide them? Or you feed the full code map to flash + your prompt so it returns the globs based on your prompt?

Also, in general, is your map of a file relatively smaller than the file itself, even for very small files?

by rafael-lua

3/14/2026 at 7:50:58 PM

- The ..-code-map.json files are per "developer folder," which would create too many conflicts if they were kept in Git.

- I have two main globs, which are lists of globs: knowledge_globs and context_globs. Knowledge can be absolute and should be relatively static. context_globs have to be relative to the workspace, since they are the working files.

- As a dev, you provide them in the top YAML section of the coder-prompt.md.

- The auto-context sub-agent calls the code-map sub-agent. Sub-agents can add to or narrow the given globs, and that is the goal of the auto-context agent.

It looks complicated, but it actually works like a charm.

Hopefully, I answered some of your questions.

I need to make a video about it.

But regardless, I really think it's not about the tools, it's about the techniques. This is where the true value is.

by jeremychone

3/14/2026 at 8:02:02 PM

  > I need to make a video about it.

My 2ct, I think writing and reading an article is easier.

by exceptione

3/14/2026 at 8:25:27 PM

point taken.

by jeremychone

3/15/2026 at 7:51:20 AM

Looking forward to some article/video

by akrauss

3/14/2026 at 5:57:51 PM

Your code map compresses signal on the context side. Same principle applies on the prompt side: prompts that front-load specifics (file, error, expected behavior) resolve in 1-2 turns. Vague ones spiral into 5-6. 1M context doesn't change that — it just gives you more room for the spiral.

by LuxBennu

3/15/2026 at 1:23:20 AM

This is interesting but don't you worry that you're competing with entire companies (e.g. Anthropic) and thus it's a losing battle? Since you're re-implementing a bunch of stuff they either do in their harness or have decided it was better not to do?

by Myrmornis

3/15/2026 at 3:25:42 AM

I think it's worth remembering that for any offering like that, it necessarily needs to be ~one-size-fits-all, while what you come up with.. doesn't.

They're solving a different problem than you. So I think it's very plausible that you could come up with something that, for your use case, performs considerably better than their "defaults".

by mh-

3/16/2026 at 11:50:47 AM

Personally I don't see aipack's pro@coder and other approaches (claude code, cursor, copilot, etc...) as competitors anymore. I use both approaches to solve different problems. I keep using the agentic solutions (claude code style) for more operational tasks, a bit like "smart interfaces to terminal", and pro@coder for coding / engineering tasks where I need a much tighter control over long running work sessions.

by sphilipakis

3/14/2026 at 10:36:10 PM

This is fascinating. I feel like this is converging into the concept of a traditional "IDE". So much of your setup reminds me of IDEs indexing, doing static analysis, building ASTs, etc. before a developer starts writing code.

by ra7

3/15/2026 at 5:24:25 PM

Yes, there is a parallel here. Now, some of those "indexing" steps can be performed by an LLM.

And that does not prevent mixing and matching the two, as some comments in this thread suggest.

Anyway, it's a great time for production coding.

by jeremychone

3/15/2026 at 8:50:50 AM

My approach has been using static analysis to produce a Mermaid diagram of all Classes:Methods and their caller/callees.

by Weryj

3/14/2026 at 6:32:21 PM

very interested in this approach and many other people are for sure. Please do a blog post.

by make_it_sure

3/13/2026 at 7:30:59 PM

The big change here is:

> Standard pricing now applies across the full 1M window for both models, with no long-context premium. Media limits expand to 600 images or PDF pages.

For Claude Code users this is huge - assuming coherence remains strong past 200k tok.

by dimitri-vs

3/14/2026 at 12:58:49 AM

Is it ever useful to have a context window that full? I try to keep usage under 40%, or about 80k tokens, to avoid what Dex Horthy calls the dumb zone in his research-plan-implement approach. Works well for me so far.

No vibes allowed: https://youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ

by MikeNotThePope

3/14/2026 at 2:08:39 AM

I'd been on Codex for a while and with Codex 5.2 I:

1) No longer found the dumb zone

2) No longer feared compaction

Switching to Opus for stupid political reasons, I still have not had the dumb zone - but I'm back to disliking compaction events and so the smaller context window it has, has really hurt.

I hope they copy OpenAI's compaction magic soon, but I am also very excited to try the longer context window.

by furyofantares

3/14/2026 at 8:48:43 AM

If you use OpenCode (open source Claude Code implementation), you can configure compaction yourself : https://opencode.ai/docs/en/config/#compaction

by pjerem

3/14/2026 at 12:39:03 PM

OpenAI has some magic they do on their standalone endpoint (/responses/compact) just for compaction, where they keep all the user messages and replace the agent messages or reasoning with embeddings.

> This list includes a special type=compaction item with an opaque encrypted_content item that preserves the model’s latent understanding of the original conversation.

Some prior discussion here https://news.ycombinator.com/item?id=46737630#46739209 regarding an article here https://openai.com/index/unrolling-the-codex-agent-loop/

by furyofantares

3/14/2026 at 10:53:55 AM

Not sure if it's a common knowledge but I've learned not that long ago that you can do "/compact your instructions here", if you just say what you are working on or what to keep explicitly it's much less painful.

In general LLMs for some reason are really bad at designing prompts for themselves. I tested it heavily on some data where there was a clear optimization function and ability to evaluate the results, and I easily beat opus every time with my chaotic full of typos prompts vs its methodological ones when it is writing instructions for itself or for other LLMs.

by comboy

3/14/2026 at 11:52:16 AM

You can also put guidance for when to compact and with what instructions into Claude.md. The model itself can run /compact, and while I try to remember to use it manually, I find it useful to have “If I ask for a totally different task and the current context won’t be useful, run /compact with a short summary of the new focus”

by brookst

3/14/2026 at 4:13:19 PM

I ofter wonder if I'm missing something, but shouldn't we be able to edit the context manually???

In that way we could erase prompts and responses that didn't yield anything useful or derailed the model.

Why can't we do that?

by copperx

3/14/2026 at 10:56:35 AM

so you have to garbage collect manually for the AI?

also, i don't want to make a full parent post

1M tokens sounds real expensive if you're constantly at that threshold. There's codebases larger in LOC; i read somewhere that Carmack has "given to humanity" over 1 million lines of his code. Perhaps something to dwell on

by genewitch

3/14/2026 at 4:51:19 AM

This is true.

When I am using codex, compaction isn’t something I fear, it feels like you save your gaming progress and move on.

For Claude Code compaction feels disastrous, also much longer

by karmasimida

3/14/2026 at 2:33:45 AM

1m context in OpenAI and Gemini is just marketing. Opus is the only model to provide real usable bug context.

by mgambati

3/14/2026 at 3:23:19 AM

I'm directly conveying my actual experience to you. I have tasks that fill up Opus context very quickly (at the 200k context) and which took MUCH longer to fill up Codex since 5.2 (which I think had 400k context at the time).

This is direct comparison. I spent months subscribed to both of their $200/mo plans. I would try both and Opus always filled up fast while Codex continued working great. It's also direct experience that Codex continues working great post-compaction since 5.2.

I don't know about Gemini but you're just wrong about Codex. And I say this as someone who hates reporting these facts because I'd like people to stop giving OpenAI money.

by furyofantares

3/14/2026 at 11:16:15 AM

I agree even though I used to be a die hard Claude fan I recently switched back to ChatGPT and codex to try it out again and they’ve clearly pulled into the lead for consistency, context length and management as well as speed. Claude Code instilled a dread in me about keeping an eye on context but I’m slowly learning to let that go with codex.

by throwthrowuknow

3/14/2026 at 4:46:36 PM

Surely compaction is down to the agent rather than the model, so are you comparing Claude Code to Codex CLI?

by HarHarVeryFunny

3/14/2026 at 8:45:03 PM

It's both.

by alex_sf

3/14/2026 at 9:37:16 AM

This has been my experience too.

by sagarpatil

3/14/2026 at 10:58:00 AM

Have any of you heard of map reduce

by genewitch

3/14/2026 at 3:39:47 AM

[flagged]

by dotancohen

3/14/2026 at 4:01:59 AM

When Anthropic said they wouldn't sell LLMs to the government for mass surveillance or autonomous killing machines, and got labeled a supply chain risk as a result, OpenAI told the public they have the same policy as Anthropic while inking a deal with the government that clearly means "actually we will sell you LLMs for mass surveillance or autonomous killing machines but only if you tell us it's legal".

If you already knew all that I'm not interested in an argument, but if you didn't know any of that, you might be interested in looking it up.

edit: Your post history has tons of posts on the topic so clearly I just responded to flambait, and regret giving my time and energy.

by furyofantares

3/14/2026 at 4:37:48 AM

I appreciate both your taking an ethical stance on openai, and the way you're engaging in this thread. The parent was probably flame bait as you say, but other people in the thread might be genuinely curious.

by igor47

3/14/2026 at 4:41:43 AM

I'm not some kind of OpenAI or Pentagon fanboy, but it's pretty easy to for me to understand why a buyer of a critical technology wants to be free to use it however they want, within the law, and not subject to veto from another entity's political opinions. It sounds perfectly reasonable to me for the military to want to decide its uses of technologies it purchases itself.

It's not like the military was specifically asking for mass surveillance, they just wanted "any legal use". Anthropic's made a lot of hay posturing as the moral defender here, but they would have known the military would never agree to their terms, which makes the whole thing smell like a bit of a PR stunt.

The supply chain risk designation is of course stupid and vindictive but that's more of an administration thing as far as I can tell.

by sho

3/14/2026 at 8:12:14 AM

As long as it's within the law? What if they politically control the law-making system? What if they've shown themselves to operate brazenly outside the law?

by lifeformed

3/14/2026 at 6:19:02 AM

Why downplay the mass surveillance aspect by saying it's a request by "the military". It's a request by the department of defense, the parent organization of the NSA.

From what has been shared publicly, they absolutely did ask for contractual limits on domestic mass surveillance to be removed, and to my read, likely technical/software restrictions to be removed as well.

What the department of defense is legally allowed to do is irrelevant and a red herring.

by stahtops

3/14/2026 at 6:56:38 AM

“Any legal use” is an exceptionally broad framework, and after the FISA “warrants,” it would appear it is incumbent on private companies to prevent breaches of the US constitution, as the government will often do almost anything in the name of “national security,” inalienable rights against search and seizure be damned.

If it isn’t written in the contract, it can and will be worked around. You learn that very quickly in your first sale to a large enterprise or government customer.

Anthropic was defending the US constitution against the whims of the government, which has shown that it is happy to break the law when convenient and whenever it deems necessary.

Note: I used to work in the IC. I have absolutely nothing against the government. I am a patriot. It is precisely for those reasons, though, that I think Anthropic did the right thing here by sticking to their guns. And the idiotic “supply chain risk” designation will be thrown out in court trivially.

by borski

3/14/2026 at 7:52:01 AM

[flagged]

by injidup

3/14/2026 at 8:16:14 AM

I hope you don't get this the wrong way. I sincerely mean it. Please, get some psychological help. Seek out a professional therapist and talk to them about your life.

by shafyy

3/14/2026 at 10:11:44 AM

I'm totally aware it's just a machine with no internal monologue. It's just a stateless text processing machine. That is not the point. The machine is able to simulate moral reasoning to an undefined level. It's not necessary to repeat this all the time. The simulation of moral reasoning and internal monologue is deep, unpredictable, not controllable and may or may not align with the interests of anyone who gives it "arms and legs" and full autonomy. If you are just interested in using these tools for glorified auto complete then you are naïve with regards to the usages other actors, including state actors are attempting to use them. Understanding and being curious about the behaviour without completely anthropomorphising them is reasonable science.

by injidup

3/14/2026 at 2:41:56 AM

Source? I ask because I use 500k+ context on these on a daily basis.

Big refactorings guided by automated tests eat context window for breakfast.

by hu3

3/14/2026 at 2:58:26 AM

i find gemini gets real real bad when you get far into the context - gets into loops, forgets how to call tools, etc

by 8note

3/14/2026 at 10:26:59 AM

yeah gemini is dumb when you tell it to do stuff - but the things it finds (and critically confirms, including doing tool calls while validating hypotheses) in reviews absolutely destroy both gpt and opus.

if you're a one-model shop you're losing out on quality of software you deliver, today. I predict we'll all have at least two harness+model subscriptions as a matter of course in 6-12 months since every model's jagged frontier is different at the margins, and the margins are very fractal.

by baq

3/14/2026 at 3:55:52 AM

I find Gemini to be real bad. Are you just using it for price reasons, or?

by petesergeant

3/14/2026 at 3:45:03 AM

I find gemini does that normally, personally. Noticeably worse in my usage than either Claude or Codex.

by girvo

3/14/2026 at 6:12:12 AM

How many big refactorings are you doing? And why?

by Bolwin

3/14/2026 at 6:49:30 AM

How is that relevant? we are talking about models, now what you do with them.

by kimi

3/14/2026 at 3:22:01 AM

Codex high reasoning has been a legitimately excellent tool for generating feedback on every plan Claude opus thinking has created for me.

by johnebgd

3/14/2026 at 4:20:57 PM

Using Codex more for now, and there is definitely some compaction magic. I’m keeping the same conversation going and going for days, some at almost 1B tokens (per the codex cli counters), with seemingly no coherency loss

by radicality

3/14/2026 at 2:33:40 AM

Hmm I’ve felt the dumb zone on codex

by iknowstuff

3/14/2026 at 3:19:51 AM

From what I've seen, it means whatever he's doing is very statistically significant.

by nomel

3/14/2026 at 3:38:26 AM

Thanks for the video.

His fix for "the dumb zone" is the RPI Framework:

● RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.

● PLAN. The agent writes a detailed step-by-step plan. You review and approve the plan, not just the output. Dex calls this avoiding "outsourcing your thinking." The plan is where intent gets compressed before execution starts.

● IMPLEMENT. Execute in a fresh context window. The meta-principle he calls Frequent Intentional Compaction: don't let the chat run long. Ask the agent to summarize state, open a new chat with that summary, keep the model in the smart zone.

by kaizenb

3/14/2026 at 10:15:05 AM

More recently I've been doing the implement phase without resetting the whole context when context is still < 60% full and must say I find it to be a better workflow in many cases (depends a bit on the size of the plan I suppose.)

It's faster because it has already read most relevant files, still has the caveats / discussion from the research phase in its context window, etc.

With the context clear the plan may be good / thorough but I've had one too many times that key choices from the research phase didn't persist because halfway through implementation Opus runs into an issue and says "You know what? I know a simpler solution." and continues down a path I explicitly voted down.

by Huppie

3/14/2026 at 3:44:19 AM

That's fascinating: that is identical to the workflow I've landed on myself.

by girvo

3/14/2026 at 4:09:53 AM

It's also identical to what Claude Code does if you put it in plan mode (bound to <tab> key), at least in my experience.

by hedora

3/14/2026 at 7:55:24 PM

better to instruct it to write a plan .md file that is appropriately named so that it can be easily referenced/updated in multiple sessions. I've found that effective.

by insane_dreamer

3/16/2026 at 9:13:14 PM

Dunno if you know this but the plan in plan mode is a markdown file! Ask it for the file and it will give it to you.

by cruffle_duffle

3/17/2026 at 11:14:05 PM

yes, but if you start a fresh session to continue working on your project, it's a lot easier if you already know which PLAN file you need for your project. Plus you can commit it.

by insane_dreamer

3/14/2026 at 4:24:58 AM

My annoyance with plan mode is where it sticks the .md file, kind of hides it away which makes it annoying to clear context and start up a new phase from the PLAN file. But that might just be a skill issue on my end

by girvo

3/14/2026 at 4:34:56 AM

Even worse, it just randomly blows away the plan file without asking for permission.

No idea what they were thinking when they designed this feature. The plan file names are randomly generated, so it could just keep making new ones forever for free (it would take a LONG time for the disk space to matter), but instead, for long plans, I have to back the plan file up if it gets stuck. Otherwise, I say "You should take approach X to fix this bug", it drops into plan mode, says "This is a completely unrelated plan", then deletes all record of what it was doing before getting stuck.

by hedora

3/14/2026 at 5:35:08 AM

It’s not just me then! Hah good to know. It’s why I’ve started ignoring plan modes in most agent harnesses, and managing it myself through prompting and keeping it in the code base (but not committed)

by girvo

3/14/2026 at 12:19:14 PM

My experience also. The claude code document feature is a real missed opportunity. As you can see in this discussion, we all have to do it manually if we want it to work.

by toddmerrill

3/14/2026 at 6:57:49 AM

After creating the plan in Plan mode (+Thinking) I ask Claude to move the plan .md file to /docs/plans folder inside the repo.

Open a new chat with Opus, thinking mode is off. Because no need when we have detailed plan.

Now the plan file is always reachable, so when the context limit is narrowing, mostly around 50%, I ask Claude to update the plan with the progress, and move to a new chat @pointing the plan file and it continue executing without any issue.

by kaizenb

3/14/2026 at 4:28:30 AM

It’s the style spec-kit uses: https://github.com/github/spec-kit

Working on my first project with it… so far so good.

by cortesoft

3/14/2026 at 7:49:03 AM

> RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.

I find myself often running validity checks between docs and code and addressing gaps as they appear to ensure the docs don’t actually lie.

by iamacyborg

3/14/2026 at 9:03:52 AM

I have Codex and Gemini critique the plan and generate their plans. Then I have Claude review the other plans and add their good ideas. It frequently improves the plan. I then do my careful review.

by silverlake

3/14/2026 at 11:29:17 AM

This is exactly how I've found leads to most consistent high quality results as well. I don't use gemini yet (except for deep research, where it pulls WAY ahead of either of the other 'grounding' methods)

But Codex to plan big features and Claude to review the feature plan (often finds overlooked discrepancies) then review the milestones and plan implementation of them in planning mode, then clear context and code. Works great.

by ArtRichards

3/14/2026 at 3:19:57 PM

Add a REFLECT phase after IMPLEMENT. I’m finding it’s extremely useful to ask agents for implementation notes and for code reviews. These are different things, and when I ask for implementation notes I get very different output than the implementation summary it spits out automatically. I ask the agent to surface all design choices it had to make that we didn’t explicitly discuss in the plan, and then check in the plan + impl notes in order to help preload context for the next thing.

My team has been adopting a separation of plan & implement organically, we just noticed we got better output that way, plus Claude now suggests in plan mode to clear context first before implementing. We are starting to do team reviews on the plan before the implement phase. It’s often helpful to get more eyeballs on the plan and improve it.

by dahart

3/14/2026 at 11:26:02 AM

How is that Plan strategy not "outsourcing your thinking" because that's exactly what it sounds like. AI does the heavy lifting and you are the editor.

by greenchair

3/14/2026 at 11:55:21 AM

Is a VP of engineering “outsourcing their thinking” by having an org that can plan and write software?

by brookst

3/14/2026 at 12:42:52 PM

Yes.

by Filligree

3/14/2026 at 3:21:18 PM

Interesting take. Does that mean SWE's are outsourcing their thinking by relying on management to run the company, designers to do UX, support folks to handle customers?

Or is thinking about source code line by line the only valid form of thinking in the world?

by brookst

3/14/2026 at 6:13:06 PM

I mean yes? That's like, the whole idea behind having a team. The art guy doesn't want to think about code, the coder doesn't want to think about finances, the accountant doesn't want to worry about customer support. It would be kind of a structural failure if you weren't outsourcing at least some of your thinking.

by qualifck

3/15/2026 at 1:15:17 AM

I’m with you, perhaps I just misread some kind of condescension into the “outsourcing your thinking” comment.

We all have limited context windows, the world’s always worked that way, just seemed odd to (mis)read someone saying there’s something wrong with focusing on when you add the greatest value and trusting others to do the same.

by brookst

3/15/2026 at 4:47:35 PM

It is condescending when antis say AI users do it. It isn’t when a director or team leader does it.

But it’s the same process, which should tell you what’s really going on here. It’s about status, not functionality, and you don’t gain status without controlling other humans.

by Filligree

3/14/2026 at 2:30:15 PM

Delegation is generally all about outsourcing, so hard agree

by Eldt

3/14/2026 at 12:30:04 PM

Offtopic: I find it remarkable the shortened YT url has a tracking cost of 57% extra length. We live in stupid times.

by alecco

3/14/2026 at 2:52:02 PM

I care about the privacy implications, but not the length. Out of curiosity, why do you care about the URL length at all? What is the cost to you?

by dahart

3/14/2026 at 3:34:33 PM

For the same reason people use link shorteners at all. It’s much more pleasant to look at and makes people more likely to press it compared to a paragraph-long URL full of tracking garbage.

by tarbyqualia

3/14/2026 at 7:07:06 PM

Please. The URL above is pretty short, this is not the kind of URL link shorteners were made for, in fact it’s already shortened, as @alecco pointed out.

Pleasant? I could not care less about the pleasantness of the video code, but a shortened URL in this case would not be more pleasant, and it would be functionally worse, and barely shorter; all you’d be able to trim is the “?si=“. I’m baffled by this thread.

by dahart

3/14/2026 at 3:25:21 PM

My point is Google engineers go to the trouble of setting up a URL shortener service on one hand, but on the other hand it seems ad the business anti-privacy executives can override anything. This points out it's a dysfunctional company.

by alecco

3/14/2026 at 6:59:17 PM

You’d rather have the video code and the tracking code baked into the same code just to save a couple of characters? Why? That would result in a longer code than the video code alone, you would save very few characters. It would not be nicer to look at or functionally any different, and it would obscure the fact that it’s being tracked and prevent people from being able to edit the URL to remove the tracking. I appreciate the fact that I can see that the URL has a tracking ID and that I can edit the URL and remove the tracking ID. I do not want a shorter URL if I lose that ability. What you’re complaining about and wishing for would be MUCH worse than what it currently is.

by dahart

3/15/2026 at 10:21:16 PM

That's what both Pinterest and Tiktok do with their mobile short links.

by essentia0

3/14/2026 at 7:15:23 PM

I didn't say that.

by alecco

3/14/2026 at 7:24:39 PM

Then your point eludes me. You complained about the length. If you don’t want it shorter, then what do you want?

To me, the fact that the tracking code is visible and separate from the video code is evidence of the complete opposite of your conclusion - it’s evidence the ad business does not get to override either engineering nor what’s left of privacy control. Ad execs would surely prefer that the tracking code is not visible nor manually removeable.

by dahart

3/14/2026 at 8:50:38 PM

I didn't complain about length per se. I pointed out Google's contradiction. As my previous comment clarified. Jesus.

by alecco

3/15/2026 at 1:55:26 PM

It’s not a contradiction, as my previous comments clarified. You’re rationalizing, making assumptions, and jumping to a conclusion.

by dahart

3/14/2026 at 4:07:47 PM

The point is whatever group controls the money controls the power.

Also, only the domain is shorter

by inemesitaffia

3/14/2026 at 5:12:15 PM

Actually, it's not just the domain:

https://youtu.be/X

https://www.youtube.com/watch?v=X

by alecco

3/14/2026 at 1:22:33 AM

Yes. I've recently become a convert.

For me, it's less about being able to look back -800k tokens. It's about being able to flow a conversation for a lot longer without forcing compaction. Generally, I really only need the most recent ~50k tokens, but having the old context sitting around is helpful.

by SkyPuncher

3/14/2026 at 1:42:55 AM

Also, when you hit compaction at 200k tokens, that was probably when things were just getting good. The plan was in its final stage. The context had the hard-fought nuances discovered in the final moment. Or the agent just discovered some tiny important details after a crazy 100k token deep dive or flailing death cycle.

Now you have to compact and you don’t know what will survive. And the built-in UI doesn’t give you good tools like deleting old messages to free up space.

I’ll appreciate the 1M token breathing room.

by hombre_fatal

3/14/2026 at 2:05:47 AM

I've found compactation kills the whole thing. Important debug steps completely missing and the AI loops back round thinking it's found a solution when we've already done that step.

by roygbiv2

3/14/2026 at 3:31:20 AM

I find it useful to make Claude track the debugging session with a markdown file. It’s like a persistent memory for a long session over many context windows.

Or make a subagent do the debugging and let the main agent orchestrate it over many subagent sessions.

by s900mhz

3/14/2026 at 4:21:24 AM

Yeah I use a markdown to put progress in. It gets kinda long and convoluted a manual intervention is required every so often. Works though.

by roygbiv2

3/14/2026 at 2:34:03 AM

For me, Claude was like that until about 2m ago. Now it rarely gets dumb after compaction like it did before.

by garciasn

3/14/2026 at 2:59:23 AM

oh, ive found that something about compaction has been dropping everything that might be useful. exact opposite experience

by 8note

3/14/2026 at 3:04:09 AM

[dead]

by myrak

3/14/2026 at 1:06:14 AM

When running long autonomous tasks it is quite frequent to fill the context, even several times. You are out of the loop so it just happens if Claude goes a bit in circles, or it needs to iterate over CI reds, or the task was too complex. I'm hoping a long context > small context + 2 compacts.

by ogig

3/14/2026 at 1:16:36 AM

Yep I have an autonomous task where it has been running for 8 hours now and counting. It compacts context all the time. I’m pretty skeptical of the quality in long sessions like this so I have to run a follow on session to critically examine everything that was done. Long context will be great for this.

by SequoiaHope

3/14/2026 at 9:09:58 AM

Are those long unsupervised sessions useful? In the sense, do they produce useful code or do you throw most of it away?

by lukan

3/14/2026 at 12:00:02 PM

I get very useful code from long sessions. It’s all about having a framework of clear documentation, a clear multi-step plan including validation against docs and critical code reviews, acceptance criteria, and closed-loop debugging (it can launch/restsart the app, control it, and monitor logs)

I am heavily involved in developing those, and then routinely let opus run overnight and have either flawless or nearly flawless product in the morning.

by brookst

3/14/2026 at 1:18:11 AM

I haven't figured out how to make use of tasks running that long yet, or maybe I just don't have a good use case for it yet. Or maybe I'm too cheap to pay for that many API calls.

by MikeNotThePope

3/14/2026 at 1:33:34 AM

My change cuts across multiple systems with many tests/static analysis/AI code reviews happening in CI. The agent keeps pushing new versions and waits for results until all of them come up clean, taking several iterations.

by ashdksnndck

3/14/2026 at 1:40:52 AM

I mean if you don't have your company paying for it I wouldn't bother... We are talking sessions of 500-1000 dollars in cost.

by tudelo

3/14/2026 at 11:45:54 AM

Right. At Opus 4.6 rates, once you're at 700k context, each tool call costs ~$1 just for cache reads alone. 100 tool calls = $100+ before you even count outputs. 'Standard pricing' is doing a lot of work here lol

by takwatanabe

3/14/2026 at 12:03:20 PM

Cache reads don’t count as input tokens you pay for lol.

https://www.claudecodecamp.com/p/how-prompt-caching-actually...

by brookst

3/14/2026 at 1:15:58 AM

All of those things are smells imo, you should be very weary of any code output from a task that causes that much thrashing to occur. In most cases it’s better to rewind or reset and adapt your prompt to avoid the looping (which usually means a more narrowly defined scope)

by boredtofears

3/14/2026 at 1:31:08 AM

A person has a supervision budget. They can supervise one agent in a hands-on way or many mostly-hands-off agents. Even though theres some thrashing assistants still get farther as a team than a single micromanaged agent. At least that’s my experience.

by grafmax

3/14/2026 at 2:44:23 AM

Just curious, what kind of work are you doing where agentic workflows are consistently able to make notable progress semi-autonomously in parallel? Hearing people are doing this, supposedly productively/successfully, kind of blows my mind given my near-daily in-depth LLM usage on complex codebases spanning the full stack from backend to frontend. It's rare for me to have a conversation where the LLM (usually Opus 4.6 these days) lasts 30 minutes without losing the plot. And when it does last that long, I usually become the bottleneck in terms of having to think about design/product/engineering decisions; having more agents wouldn't be helpful even if they all functioned perfectly.

by not_kurt_godel

3/14/2026 at 2:59:46 AM

I've passed that bottleneck with a review task that produces engineering recommendations along six axis (encapsulation, decoupling, simplification, dedoupling, security, reduce documentation drift) and a ideation tasks that gives per component a new feature idea, an idea to improve an existing feature, an idea to expand a feature to be more useful. These two generate constant bulk work that I move into new chat where it's grouped by changeset and sent to sub agent for protecting the context window.

What I'm doing mostly these days is maintaining a goal.md (project direction) and spec.md (coding and process standards, global across projects). And new macro tasks development, I've one under work that is meant to automatically build png mockup and self review.

by avereveard

3/14/2026 at 3:17:20 AM

What are you using to orchestrate/apply changes? Claude CLI?

by not_kurt_godel

3/14/2026 at 4:52:57 AM

I prefer in IDE tools because I can review changes and pull in context faster.

At home I use roo code, at work kiro. Tbh as long as it has task delegation I'm happy with it.

by avereveard

3/14/2026 at 11:17:26 PM

I work on 1M LOC 15 yr old repo. Like you it's across the full stack. Bugs in certain pieces of complex business logic would have catastrophic consequences for my employer. Basically I peel poorly-specific work items off my queue into its own worktree and session at high reasoning/effort and provide a well-specified prompt.

These things eat into my supervision budget:

* LLM loses the plot and I have to nudge (like you) * Thinking hard to better specify prompts (like you) * Reviewing all changes (I do not vibe code except for spikes or other low-risk areas) * Manual thing I have to do (for things I have not yet automated with a agent-authored scripts) * Meetings * etc

So, yes, my supervision budget is a bottleneck. I can only run 5-8 agents at a time because I have only so much time in the day.

Compare that vs a single agent at high reasoning/effort: I am sitting waiting for it to think. Waiting for it to find the code area I'm talking about takes time. Compiling, running tests, fixing compile errors. A million other things.

Any time I find myself sitting and waiting, this is a signal to me to switch to a different session.

by grafmax

3/14/2026 at 1:31:12 AM

weary (tired) -> wary (cautious)

by chrisweekly

3/14/2026 at 1:48:29 AM

Wary, not weary. Wary: cautious. Weary: tired.

by saaaaaam

3/14/2026 at 7:21:28 AM

This is really common, I think because there’s also “leery” - cautious, distrustful, suspicious.

by dentalnanobot

3/14/2026 at 1:14:00 AM

It's kind of like having a 16 gallon gas tank in your car versus a 4 gallon tank. You don't need the bigger one the majority of the time, but the range anxiety that comes with the smaller one and annoyance when you DO need it is very real.

by dimitri-vs

3/14/2026 at 1:32:36 AM

It seems possible, say a year or two from now that context is more like a smart human with a “small”, vs “medium” vs “large” working memory. The small fellow would be able to play some popular songs on the piano , the medium one plays in an orchestra professionally and the x-large is like Wagner composing Der Ring marathon opera. This is my current, admittedly not well informed mental model anyway. Well, at least we know we’ve got a little more time before the singularity :)

by steve-atx-7600

3/14/2026 at 2:37:24 AM

It’s more like the size of the desk the AI has to put sheets of paper on as a reference while it builds a Lego set. More desk area/context size = able to see more reference material = can do more steps in one go. I’ve lately been building checklists and having the LLM complete and check off a few tasks at a time, compacting in-between. With a large enough context I could just point it at a PLAN.md and tell it to go to work.

by twodave

3/14/2026 at 1:30:37 AM

Except after 4 gallons it might as well be pure oil, mucking everything up.

by scwoodal

3/14/2026 at 2:03:25 AM

Since I'm yet to seriously dive into vibe coding or AI-assisted coding, does the IDE experience offer tracking a tally of the context size? (So you know when you're getting close or entering the "dumb zone")?

by ricksunny

3/14/2026 at 2:56:27 AM

The 2 I know, Cursor and Claude Code, will give you a percentage used for the context window. So if you know the size of the window, you can deduce the number of tokens used.

by MikeNotThePope

3/14/2026 at 12:05:22 PM

Claude code also gives you a granular breakdown of what’s using context window (system prompt, tools, conversation history, etc). /context

by brookst

3/14/2026 at 6:32:32 AM

In Claude code I believe it's /context and it'll give you a graphical representation of what's taking context space

by jfim

3/14/2026 at 2:10:37 AM

> Since I'm yet to seriously dive into vibe coding or AI-assisted coding

Unless you’re using a text editor as an IDE you probably have already

by nujabe

3/14/2026 at 3:00:27 AM

Cline gives you such a thing. you dont really know where the dumb zone by numbers though, only by feel.

by 8note

3/14/2026 at 2:08:09 AM

Most tools do, yes.

by stevula

3/14/2026 at 2:09:04 AM

OpenCode does this. Not sure about other tools

by quux

3/14/2026 at 3:53:34 AM

Looking at this URL, typo or YouTube flip the si tracking parameter?

  youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ

by Barbing

3/14/2026 at 9:52:02 AM

I just cut & pasted the share URL provided by YouTube. Strip out the query param if you like.

by MikeNotThePope

3/15/2026 at 3:25:41 AM

Ooh it’s always ?si=

So this…

  ?is=

…that’s new.

Think you got A/B tested. Flipping the parameter breaks a lot of RegEx. Interesting!

by Barbing

3/14/2026 at 8:06:33 AM

I never use these giant context windows. It is pointless. Agents are great at super focused work that is easy to re-do. Not sure what is the use case for giant context windows.

by dev_l1x_be

3/14/2026 at 10:18:31 AM

Maxing out context is only useful if all the information is directly relevant and tightly scoped to the task. The model's performance tends to degrade with too much loosely related data, leading to more hallucinations and slower results. Targeted chunking and making sure context stays focused almost always yields better outcomes unless you're attempting something atypical, like analyzing an entire monorepo in one shot.

by hrmtst93837

3/14/2026 at 12:12:43 PM

I've used it many times for long-running investigations. When I'm deep in the weeds with a ton of disassembly listings and memory dumps and such, I don't really want to interrupt all of that with a compaction or handoff cycle and risk losing important info. It seems to remain very capable with large contexts at least in that scenario.

by wat10000

3/14/2026 at 1:59:49 AM

After running a context window up high, probably near 70% on opus 4.6 High and watching it take 20% bites out of my 5hr quota per prompt I've been experimenting with dumping context after completing a task. Seems to be working ok. I wonder if I was running into the long context premium. Would that apply to Pro subs or is just relevant to api pricing?

by maskull

3/14/2026 at 1:16:53 PM

I haven't hit the "dumb zone" any more since two months. I think this talk is outdated.

I'm using CC (Opus) thinking and Codex with xhigh on always.

And the models have gotten really good when you let them do stuff where goals are verifiable by the model. I had Codex fix a Rust B-rep CSG classification pipeline successfully over the course of a week, unsupervised. It had a custom STEP viewer that would take screenshots and feed them back into the model so it could verify the progress resp. the triangle soup (non progress) itself.

Codex did all the planning and verification, CC wrote the code.

This would have not been possible six months ago at all from my experience.

Maybe with a lot of handholding; but I doubt it (I tried).

I mean both the problem for starters (requires a lot of spatial reasoning and connected math) and the autonomous implementation. Context compression was never an issue in the entire session, for either model.

by virtualritz

3/14/2026 at 1:47:53 AM

That video is bizarre. Such a heavy breather.

by saaaaaam

3/14/2026 at 4:24:02 AM

What a weird and inconsequential thing to focus on...

He's just fucking closely miced with compression + speaking fast and anxious/excited speaking to an audience

by coldtea

3/15/2026 at 4:25:45 PM

Maybe. But that’s what I focused on, for better or worse. I couldn’t concentrate on what he was saying because of it. Maybe bad mic placement, but the end results was like some sort of old school phone sex pest.

by saaaaaam

3/14/2026 at 4:26:16 AM

Most of that is just nervousness

by indigodaddy

3/14/2026 at 3:22:27 AM

Yes. I’ve used it for data analysis

by bushbaba

3/14/2026 at 2:33:41 AM

I mean, try using copilot on any substantial back-end codebase and watch it eat 90+% just building a plan/checklist. Of course copilot is constrained to 120k I believe? So having 10x that will blow open up some doors that have been closed for me in my work so far.

That said, 120k is pleeenty if you’re just building front-end components and have your API spec on hand already.

by twodave

3/15/2026 at 7:58:16 PM

[dead]

by alexey-pelykh

3/14/2026 at 2:20:52 PM

If it's not coding, even with 200k context it starts to write gibberish, even with the correct information in the context.

I tried to ask questions about path of exile 2. And even with web research on it gave completely wrong information... Not only outdated. Wrong

I think context decay is a bigger problem then we feel like.

by Bombthecat

3/14/2026 at 3:20:30 PM

Fwiw put a copy of the game folder in a directory and tell claude to extract game files and dissasemble the game in preparation for questions about the game.

As an example of doing this in a session with jagged alliance 3 (an rpg) https://pastes.io/jagged-all-69136

Claude extracting game archives and dissasembling leads to far more reliable results than random internet posts.

by AnotherGoodName

3/14/2026 at 4:53:42 PM

You’re having Claude design builds for you by disassembling the game? Am I understanding that right? I guess I’m thinking too small.

by jnovek

3/14/2026 at 6:17:12 PM

Yes exactly. Claude can just go in, extract the compressed game archives, decompile and read the game logic directly for how everything works. ie. You might be curious how certain stats translate into damage. Just do the above and ask Claude "in detail explain from the decompiled code in this folder for game X how certain stats affect damage and suggest builds to maximise damage taking into account character level <10.".

I've found doing this for games to be far more reliable than trying to find internet posts explaining it. I haven't played POE but if it's anything like any other RPG system Claude will do a great job at this.

by AnotherGoodName

3/15/2026 at 1:41:53 AM

Just as another one i did today. I asked claude to decompile Newstower and make a static site that could take in game saves, edit them and download them again.

https://thedailycheat.com/

Seriously coding agents make messing with games super easy.

by AnotherGoodName

3/14/2026 at 5:41:10 PM

This will not work for an online game like PoE 2

Or even one with DRM?

Right?

Or?

by heraldgeezer

3/14/2026 at 6:14:07 PM

DRM just stops you launching/connecting to servers if you modified the binary. It does nothing to stop the binary being pulled apart by a bot with no intention of running it.

The place it may fail is obfuscation and server side logic. But generally client side logic, especially in a game with a scripted language backing it, is super easy for claude ot pick apart.

by AnotherGoodName

3/14/2026 at 3:04:46 PM

Context decay is noticeable within 3 messages, nearly every time. Maybe not substantial, but definitely noticeable.

It’s lead to me starting new chats with bigger and bigger starting ‘summary, prompts to catch the model up while refreshing it. Surely there’s a way to automate that technique.

by Lord-Jobo

3/14/2026 at 3:58:24 PM

Yeah absolutely, at this point I also start new chats after 3-4 prompts. Especially with thinking models that produce so many tokens.

Usually things go smoothly but sometimes I have situations like: “please add feature X, needs to have ABCD.” -> does ABC correct but D wrong -> “here is how to fix D” -> fixes D but breaks AB -> “remember I also want AB this way, you broke it” -> fixes AB but removes C and so on

by AStrangeMorrow

3/14/2026 at 3:10:46 PM

[dead]

by nvardakas

3/14/2026 at 4:41:33 PM

> I build with Claude Code daily and the context decay is real by the end of a long session it starts forgetting decisions we made earlier

I generate task.md files before working on anything, some are short, others are super long and with many steps. The models don't deviate anymore. One trick is to make a post tool use hook to show the first open gate "- [ ]" line from task.md on each tool call. This keeps the agent straight for 100s of gates.

After each gate is executed we don't just check it, we also append a few words of feedback. This makes the task.md become a workbook, covering intent, plan, execution and even judgements. I see it like a programming language now. I can gate any task and the agent will do it, however many steps. It can even generate new gates, or replan itself midway.

You can enforce strict testing policies by just leaning into gate programability power - after each work gate have a test gate, and have judges review testing quality and propose more tests.

The task.md file is like a script or pipeline. It is also like a first class function, it can even ingest other task.md files for regular reflexion. A gate can create or modify gates, or tasks. A task can create or modify gates or tasks.

by visarga

3/16/2026 at 8:47:47 AM

[dead]

by nvardakas

3/14/2026 at 3:01:30 PM

It could also be a skill problem. It would be more helpful if when people made llm sucks claims they shared their prompt.

The people I work with who complain about this type of thing horribly communicate their ask to the llm and expect it to read their minds.

by eric_cc

3/14/2026 at 3:39:59 PM

I don't really understand what you mean by this. The claim is that the same prompt with the same question produces worse results when it's queried in a model that has more than 200k tokens in its context. That doesn't have to do much with the "skillfulness" of using a model.

by namr2000

3/14/2026 at 4:10:59 PM

Prompt quality does matter, but at some point context side does matter.

I’ve had thing like a system that has a collection of procedural systems. I would say “replace the following set of defaults that are passed all around for system X (list of files) and in the managed (file) by a config” and it would do that but I’d suddenly see it be like “wait mu and projection distance are also present in system Y and Z. Let me replace that by a config too with the same values”. When system Y and Z uses a different set of optimized values, and that was clearly outside of the scope.

Never had that kind of mistakes happen when dealing with small contexts, but with larger contexts (multiple files, long “thinking” sequences) it does happen sometimes.

Definitely some times when I though “oh well my bad, I should have clarified NOT to also change that other part”, all the while thinking that no human would have thought to change both

by AStrangeMorrow

3/14/2026 at 4:42:58 PM

None of what has been described is a "skill issue". The problem is when an identical prompt produces poor results once the context window exceeds 200k tokens or so.

by trollbridge

3/14/2026 at 3:56:58 PM

Totally agree the LLM sucks posts should be accompanied with the prompt.

by alwillis

3/14/2026 at 4:07:58 PM

I agree, but at the same time it feels like victim blaming.

by copperx

3/14/2026 at 6:12:11 PM

I don't know. Is pointing out that someone holding a drill by the chuck won't get the results they expect that bad?

by akersten

3/15/2026 at 9:17:51 PM

But what if the drill is non deterministic?

by copperx

3/14/2026 at 9:26:52 PM

Nah, it's a variant of the XY Problem: https://xyproblem.info

by theshrike79

3/14/2026 at 5:20:15 PM

Adding web search doesn't necessarily lead to better information at any context.

In my experience the model will assume the web results are the answer even if the search engine returns irrelevant garbage.

For example you ask it a question about New Jersey law and the web results are about New York or about "many states" it'll assume the New York info or "many states" info is about New Jersey.

by staticman2

3/14/2026 at 3:01:24 PM

I think ChatGPT has a huge advantage here. They have been collecting realistic multi-turn conversational data at a much larger scale. And generally their models appear to be more coherent with larger contexts for general purpose stuff.

by blueblisters

3/14/2026 at 4:31:03 PM

Number one thing you always need to accomplish are feedback loops for Claude so it's able to shotgun program itself to a solution.

by holoduke

3/14/2026 at 3:18:04 PM

The question that comes to mind for me after reading your comment is how can a question about a game require that much context?

by gorjusborg

3/14/2026 at 3:52:50 PM

Path of exile is complex, just check the skill tree , skills and gems:)

It could almost be used as a benchmark good models are in math, memory, updated information etc

by Bombthecat

3/14/2026 at 3:06:05 PM

I feel like few weeks ago i suddenly had a week where even after 3 messages it forgot what we did. Seems fixed now.

by wouldbecouldbe

3/14/2026 at 2:38:09 PM

We need an MCP for path of building

by turbostyler

3/14/2026 at 4:21:10 PM

Agreed, there's no getting around the "break it into smaller contexts" problem that lies between us and generally useful AI.

It'll remain a human job for quite a while too. Separability is not a property of vector spaces, so modern AIs are not going to be good at it. Maybe we can manage something similar with simplical complexes instead. Ideally you'd consult the large model once and say:

> show me the small contexts to use here, give me prompts re: their interfaces with their neighbors, and show me which distillations are best suited to those tasks

...and then a network of local models could handle it from there. But the providers have no incentive to go in that direction, so progress will likely be slow.

by __MatrixMan__

3/14/2026 at 2:33:56 PM

That’s not context decay, that’s training data ambiguity. So much misinformation, nerfs, buffs, changes that an LLM can not keep up given the training time required. Do it for a game that has been stable and it knows its stuff.

by reactordev

3/14/2026 at 2:44:16 PM

It didnt gave outdated, on some cases it did, and with two tries telling it to search for updated information it got it right ( shouldn't need to do that though) but it also gave wrong information about sockets ( support skills) , which never existed or never were able to be socketed together in the first place. ( Ok maybe in 0.1, but that's what web search is for ... ) If it even can't handle easy versioned information from a game. How should it handle anything related to time, dates, news, science etc?

by Bombthecat

3/14/2026 at 5:02:50 PM

Like any human would, 75% certain with 99% confidence. That’s what you fail to realize. They aren’t “god mode machine”. They are “human-mode” machines and humans make mistakes in thinking just like you do. Some might say asking a powerful LLM for gaming tips is a waste of compute power. Others might say it gives you the knowledge of a new meta emerging. Either way, you both are going to get trained.

by reactordev

3/14/2026 at 2:57:55 PM

Please don’t pop the AI bubble, bro. Stop asking questions, bro. Believe the hype, bro.

by serial_dev

3/14/2026 at 4:51:51 PM

What were you asking about PoE 2? So far my _general_ experience with asking LLMs about ARPGs has been meh. Except for Diablo 2 but I think that’s just because Diablo 2 has been heavily discussed for ~25 years.

by jnovek

3/14/2026 at 1:12:02 AM

I've been using the 1M window at work through our enterprise plan as I'm beginning to adopt AI in my development workflow (via Cline). It seems to have been holding up pretty well until about 700k+. Sometimes it would continue to do okay past that, sometimes it started getting a bit dumb around there.

(Note that I'm using it in more of a hands-on pair-programming mode, and not in a fully-automated vibecoding mode.)

by a_e_k

3/14/2026 at 1:23:19 AM

So a picture is worth 1,666 words?

by chatmasta

3/14/2026 at 1:04:56 AM

The quality with the 1M window has been very poor for me, specifically for coding tasks. It constantly forgets stuff that has happened in the existing conversation. n=1, ymmv

by islewis

3/14/2026 at 3:27:25 AM

Yes, especially with shifts in focus of a long conversation. But given the high error rates of Opus 4.6 the last few weeks it is possibly due to other factors. Conversational and code prodding has been essential.

by robwwilliams

3/14/2026 at 1:07:55 AM

Well, the question is what is contributing to the usage. Because as the context grows, the amount of input tokens are increasing. A model call with 800K token as input is 8 times more expensive than a model call with 100K tokens as input. Especially if we resume a conversation and caching does not hit, it would be very expensive with API pricing.

by hagen8

3/14/2026 at 4:52:19 PM

This might burn through usage faster too though.

by j45

3/14/2026 at 2:07:30 PM

yeah it totally does not remain coherent past 200k, would have been too nice.

by jFriedensreich

3/14/2026 at 4:30:59 PM

I bet it depends how homogenous the context is. I bet it works ok near 1M in some cases, but as far as I can tell, those cases are rare.

by __MatrixMan__

3/14/2026 at 3:24:13 PM

[dead]

by alexcali

3/14/2026 at 4:23:04 AM

It’s interesting because my career went from doing higher level language (Python) to lower language (C++ and C). Opus and the like is amazing at Python, honestly sometimes better than me but it does do some really stupid architectural decisions occasionally. But when it comes to embedded stuff, it’s still like a junior engineer. Unsure if that will ever change but I wonder if it’s just the quality and availability of training data. This is why I find it hard to believe LLMs will replace hardware engineers anytime soon (I was a MechE for a decade).

by syntaxing

3/14/2026 at 9:28:53 AM

As someone who did Python professionally from a software engineering perspective, I've actually found Python to be pretty crappy really: unaware of _good_ idioms living outside tutorials and likely 90% of Python code out there that was simply hacked together quickly.

I have not tested, but I would expect more niche ecosystems like Rust or Haskell or Erlang to have better overall training set (developer who care about good engineering focus on them), and potentially produce the best output.

For C and C++, I'd expect similar situation with Python: while not as approachable, it is also being pushed on beginning software engineers, and the training data would naturally have plenty of bad code.

by necovek

3/14/2026 at 11:35:14 AM

I think its pretty good at Elixir, so that tracks.

by jeremyjh

3/14/2026 at 2:47:21 PM

Came here to say this.

by borski

3/14/2026 at 3:19:10 PM

Can you recommend some books that teach these idioms? I know not everything is in books but I suspect a bit of it is

by mettamage

3/14/2026 at 6:33:14 AM

I've found it's ok at Rust. I think a lot of existing Rust code is high quality and also the stricter Rust compiler enforces that the output of the LLM is somewhat reasonable.

by n_u

3/14/2026 at 8:32:55 AM

Yes, it's nice to have a strict compiler, so the agent has to keep fixing its bugs until it actually compiles. Rust and TypeScript are great for this.

by lemagedurage

3/14/2026 at 4:00:03 PM

A big downside with rust is the compile times. Being in a tight AI loop just wasn't part of the design of any existing programming languages.

As languages designed for (and probably written by) AI come out over the next decade, it will be really interesting to see what dragon tradeoffs they make.

by apitman

3/14/2026 at 4:37:04 PM

"cargo check" is fast and it's enough for the AI to know the code is correct.

I would argue that because Rust is so strict having the agent compile and run tests on every iterations is actually less needed then in other languages.

I program mostly in python but I keep my projects strictly typed with basedpyright and it greatly reduced the amount of errors the agent makes because it can get immediate feedback it has done something stupid.

Of course you still need to review the code because it doesn't solve logic bugs.

by veber-alex

3/14/2026 at 6:14:06 PM

cargo check is faster; it's not fast

by apitman

3/14/2026 at 4:33:10 PM

>Being in a tight AI loop just wasn't part of the design of any existing programming languages.

I would dare to say that any Lisp (Common Lisp, Clojure, Racket, whatever) is perfect for a tight AI loop thanks to REPL-driven development. It's an interesting space to explore and I know that the Clojure community at least are trying to figure out something there.

by masijo

3/14/2026 at 11:59:37 AM

Quite sure it's not about the language but the domain.

by raincole

3/14/2026 at 2:25:49 PM

Agreed. When I've written very low level code where there are "odd" constraints ("this function must never take a lock, no system calls can be made" etc) the LLM would accidentally violate them. It seems sort of obvious why - the vast majority of code it is trained on does not have those constraints.

by staticassertion

3/14/2026 at 9:17:44 AM

I think the combinatorial space is just too much. When I did web dev it was mostly transforming HTML/JSON from well-defined type A to well-defined type B. Everything is in text. There's nothing to reason about besides what is in the prompt itself. But constructing and maintaining a mental model of a chip and all of its instructions and all of the empirical data from profiling is just too much for SOTA to handle reliably.

by ipnon

3/14/2026 at 4:33:07 AM

I've had a similar experience as a graphics programmer that works in C++ every day

Writing quick python scripts works a lot better than niche domain specific code

by ex-aws-dude

3/14/2026 at 5:14:29 AM

Unfortunately, I’ve found it’s really good at Wayland and OpenGL. It even knows how to use Clutter and Meta frameworks from the Gnome Mutter stack. Makes me wonder why I learned this all in the first place.

by nullpoint420

3/14/2026 at 6:24:35 AM

To being able to determine it's really good.

by Trufa

3/14/2026 at 10:13:54 AM

LLMsdo great with Rust though

by trenchgun

3/14/2026 at 1:21:56 PM

It is really good at writing C++ for Arduino, can one-shot most programs.

by ricardobeat

3/14/2026 at 2:54:37 PM

I'd say the chance of me one shotting C++ is veeeery low. Same for bash scripts etc. This is where the LLM really shines for me.

by NanoWar

3/14/2026 at 10:11:57 PM

How was the career transition from MechE? Looking to do the same myself.

by Skidaddle

3/14/2026 at 1:35:28 PM

nor web engineers (backend) that are not doing standard crud work.

I have seen these shine on frontend work

by dzonga

3/14/2026 at 6:13:49 AM

[dead]

by imposter

3/13/2026 at 7:54:42 PM

Is there a writeup anywhere on what this means for effective context? I think that many of us have found that even when the context window was 100k tokens the actual usable window was smaller than that. As you got closer to 100k performance degraded substantially. I'm assuming that is still true but what does the curve look like?

by convenwis

3/14/2026 at 2:36:44 AM

> As you got closer to 100k performance degraded substantially

In practice, I haven't found this to be the case at all with Claude Code using Opus 4.6. So maybe it's another one of those things that used to be true, and now we all expect it to be true.

And of course when we expect something, we'll find it, so any mistakes at 150k context use get attributed to the context, while the same mistake at 50k gets attributed to the model.

by esperent

3/14/2026 at 8:27:54 AM

My personal experience is that Opus 4.6 degrades after a while but the degradation is more subtle and less catastrophic than in the past. I still aggressively clear sessions to keep it sharp though.

by peacebeard

3/14/2026 at 5:11:54 AM

Personally, even though performance up to 200k has improved a lot with 4.5 and 4.6, I still try to avoid getting up there — like I said in another comment, when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I feel like the dropoff starts around maybe 150k, but I could be completely wrong. I thought it was funny that the graph in the post starts at 256k, which convenient avoids showing the dropoff I'm talking about (if it's real).

by dcre

3/14/2026 at 12:31:45 AM

I mentioned this at work but context still rots at the same rate. 90k tokens consumed has just as bad results in 100k context window or 1M.

Personally, I’m on a 6M+ line codebase and had no problems with the old window. I’m not sending it blindly into the codebase though like I do for small projects. Good prompts are necessary at scale.

by tyleo

3/13/2026 at 7:57:48 PM

The benchmark charts provided are the writeup. Everything else is just anecdata.

by minimaxir

3/14/2026 at 1:03:42 AM

Isn't transformer attention quadratic in complexity in terms of context size? In order to achieve 1M token context I think these models have to be employing a lot of shortcuts.

I'm not an expert but maybe this explains context rot.

by FartyMcFarter

3/14/2026 at 1:56:46 AM

Nope, there’s no tricks unless there’s been major architectural shifts I missed. The rot doesn’t come from inference tricks to try to bring down quadratic complexity of the KV cache. Task performance problems are generally a training problem - the longer and larger the data set, the fewer examples you have to train on it. So how do you train the model to behave well - that’s where the tricks are. I believe most of it relies on synthetically generated data if I’m not mistaken, which explains the rot.

by vlovich123

3/14/2026 at 10:32:34 AM

A quick Google search reveals terms such as "sparse attention" that are used to avoid quadratic runtime.

I don't know if Anthropic has revealed such details since AI research is getting more and more secretive, but the architectural tricks definitely exist.

by FartyMcFarter

3/15/2026 at 4:22:44 PM

Then you need to do a little bit deeper research. No one just applies sparse attention at inference time for a model not trained for it. They do this at training time because otherwise the task performance degrades too much.

by vlovich123

3/14/2026 at 12:46:48 AM

The weirdest thing about Claude pricing is their 5X pricing plan is 5 times the cost of the previous plan.

Normally buying the bigger plan gives some sort of discount.

At Claude, it's just "5 times more usage 5 times more cost, there you go".

by wewewedxfgdf

3/14/2026 at 1:16:48 AM

Those sorts of volume discounts are what you do when you're trying to incentivize more consumption. Anthropic already has more demand then they're logistically able to serve, at the moment (look at their uptime chart, it's barely even 1 9 of reliability). For them, 1 user consuming 5 units of compute is less attractive than 5 users consuming 1 unit.

They would probably implement _diminishing_-value pricing if pure pricing efficiency was their only concern.

by apetresc

3/14/2026 at 1:01:16 AM

It is not the plan they want you to buy. It is a pricing strategy to get you to buy the 20x plan.

by auggierose

3/14/2026 at 1:11:12 AM

5x Max is the plan I use because the Pro plan limits out so quickly. I don't use Claude full-time, but I do need Claude Code, and I do prefer to use Opus for everything because it's focused and less chatty.

by radley

3/14/2026 at 1:17:53 AM

Sure, I get it. For me a 2x Max would be ideal and usually enough. Now, guess why they are not offering that?

by auggierose

3/14/2026 at 6:12:31 PM

Where do you live? I'm in the midwest, US, and theoretical savings between 2x and 5x amounts to a single full bag of groceries. Literally.

How can this possibly be a concern?

by laksjhdlka

3/16/2026 at 8:28:16 AM

You are clearly overpaying for your groceries. I can cook for a week from the difference.

If there wasn't a significant difference, they would offer 2x Max, the demand is there. But they don't. Clearly, their strategy works on fanboys like you.

by auggierose

3/14/2026 at 3:29:13 AM

Same here. I'd love a 2x Max plan! More than enough usage for my needs.

by prtmnth

3/14/2026 at 12:58:21 PM

Just make two plans and switch, when one of them is exhausted.

by geggo98

3/14/2026 at 1:04:54 PM

Yes, I hear that is what people do. Annoying though.

by auggierose

3/14/2026 at 1:02:02 AM

I think they are both subsidized so either is a great deal.

by operatingthetan

3/14/2026 at 3:45:09 PM

Yeah the free lunch on tokens is almost over. Get them while they’re still cheap

by cush

3/14/2026 at 6:28:50 AM

5 times the already subsidised rate is still a discount.

by merrvk

3/14/2026 at 3:43:55 AM

We’ll make it up on volume.

by tclancy

3/14/2026 at 1:01:06 AM

5 for 5

by Zambyte

3/13/2026 at 7:51:44 PM

Claude Code 2.1.75 now no longer delineates between base Opus and 1M Opus: it's the same model. Oddly, I have Pro where the change supposedly only for Max+ but am still seeing this to be case.

EDIT: Don't think Pro has access to it, a typical prompt just hit the context limit.

The removal of extra pricing beyond 200k tokens may be Anthropic's salvo in the agent wars against GPT 5.4's 1M window and extra pricing for that.

by minimaxir

3/14/2026 at 12:43:13 AM

No change for Pro, just checked it, the 1M context is still extra usage.

by auggierose

3/14/2026 at 2:00:18 AM

I have Max 20x and they're still separate on 2.1.75.

by zaptrem

3/15/2026 at 12:42:18 PM

Mine took restarting. Is it still separate for you? Also might not work yet if you have CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 on

by hackyon1

3/14/2026 at 3:34:58 AM

Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter.

I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time. Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.

It was also the first AI I felt, "Damn, this thing is smarter than me."

The other crazy thing is that with today's tech, these things can be made to work at 1k tokens/sec with multiple agents working at the same time, each at that speed.

by Frannky

3/14/2026 at 4:07:50 AM

I wish I had this kind of experience. I threw a tedious but straightforward task at Claude Code using Opus 4.6 late last week: find the places in a React code base where we were using useState and useEffect to calculate a value that was purely dependent on the inputs to useEffect, and replace them with useMemo. I told it to be careful to only replace cases where the change did not introduce any behavior changes, and I put it in plan mode first.

It gave me an impressive plan of attack, including a reasonable way to determine which code it could safely modify. I told it to start with just a few files and let me review; its changes looked good. So I told it to proceed with the rest of the code.

It made hundreds of changes, as expected (big code base). And most of them were correct! Except the places where it decided to do things like put its "const x = useMemo(...)" call after some piece of code that used the value of "x", meaning I now had a bunch of undefined variable references. There were some other missteps too.

I tried to convince it to fix the places where it had messed up, but it quickly started wanting to make larger structural changes (extracting code into helper functions, etc.) rather than just moving the offending code a few lines higher in the source file. Eventually I gave up trying to steer it and, with the help of another dev on my team, fixed up all the broken code by hand.

It probably still saved time compared to making all the changes myself. But it was way more frustrating.

by koreth1

3/14/2026 at 4:24:32 AM

One tip I have is that once you have the diff you want to fix, start a new session and have it work on the diff fresh. They’ve improved this, but it’s still the case that the farther you get into context window, the dumber and less focused the model gets. I learned this from the Claude Code team themselves, who have long advised starting over rather than trying to steer a conversation that has started down a wrong path.

I have heard from people who regularly push a session through multiple compactions. I don’t think this is a good idea. I virtually never do this — when I see context getting up to even 100k, I start making sure I have enough written to disk to type /new, pipe it the diff so far, and just say “keep going.” I learned recently that even essentials like the CLAUDE.md part of the prompt get diluted through compactions. You can write a hook to re-insert it but it's not done by default.

This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean.

by dcre

3/14/2026 at 4:49:27 AM

Yeah, I start most of my sessions now with “read the diff between this branch and main”. Seems like it grounds and focuses it.

by kjohanson

3/14/2026 at 5:27:14 AM

Slight tangent: you want to read the diff between your branch and the merge-base with origin/main. Otherwise you get lots of spurious spam in your diff, if main moved since you branched off.

by eru

3/15/2026 at 12:14:00 AM

In jj this is jj diff -f ‘fork_point(trunk() | @)’. I have an alias for it.

by dcre

3/15/2026 at 6:41:44 AM

What's jj? In Git I also have an alias for diffing against the merge-base. (It's also what GitHub gives you by default in the webview.)

by eru

3/14/2026 at 6:50:40 AM

One thing that seems important is to have the agent write down their plan and any useful memory in markdown files, so that further invocations can just read from it

by nextaccountic

3/14/2026 at 6:29:23 AM

IMO it seems to start "forgetting" or "overlooking" claude.md well before the context window is full.

by Glyptodon

3/14/2026 at 5:25:16 AM

subagents are huge, could execute on a massive plan that should easily fill up a 200k context window and be done atnaround 60k for the orchestration agent.

as a cheapass, being able to pass off the simple work to cheaper $ per token agents is also just great. I've got a handful of tasks I can happily delegate work to a haiku agent and anything requiring a bit of reasoning goes to sonnet.

Feel like opus is almost a cheatcode when i do get stuck, i just bust out a full opus workflow instead and it just destroys everything i was struggling with usually. like playing on easy mode.

as cool as this stuff is, kinda still wish i was just grandfathered into the plan with no weekly limit and only the 5 hour window limits, id just be happily hammering opus blissfully.

by sidrag22

3/14/2026 at 1:26:40 PM

>"This fresh context thing is a big reason subagents might work where a single agent fails. It’s not just about parallelism: each subagent starts with a fresh context, and the parent agent only sees the result of whatever the subagent does — its own context also remains clean."

This is the true power of agent teams: https://code.claude.com/docs/en/agent-teams

You maintain very low context usage in the main thread; just orchestration and planning details, while each individual team member remains responsible for their own. Allows you to churn through millions of output tokens in a fraction of the time.

by ramesh31

3/14/2026 at 7:50:05 AM

Same here. I don't understand how people leave it running on an "autopilot" for long periods of time. I still use it interactively as an assistant, going back and forth and stepping in when it makes mistakes or questionable architectural decisions. Maybe that workflow makes more sense if you're not a developer and don't have a good way to judge code quality in the first place.

There's probably a parallel with the CMSes and frameworks of the 2000s (e.g. WordPress or Ruby on Rails). They massively improved productivity, but as a junior developer you could get pretty stuck if something broke or you needed to implement an unconventional feature. I guess it must feel a bit similar for non-developers using tools like Claude Code today.

by olalonde

3/14/2026 at 1:30:02 PM

>Same here. I don't understand how people leave it running on an "autopilot" for long periods of time.

Things have changed. The models have reached a level of coherence that they can be left to make the right decisions autonomously. Opus 4.6 is in a class of its own now.

by ramesh31

3/14/2026 at 1:42:00 PM

A non-technical client of mine has built an entire app with a very large feature set with Opus. I declined to work on it to clean it up, I was afraid it would have been impossible and too much risk. I think we are at a level where it can build and auto-correct its mistakes, but the code is still slop and kind of dangerous to put in production. If you care about the most basic security.

by devld

3/14/2026 at 4:13:41 AM

Branch first so you can just undo. I think this would have worked with sub agents and /loop maybe? Write all items to change to a todo.md. Have it split up the work with haiku sub agents doing 5-10 changes at a time, marking the todos done, and /loop until all are done. You’ll succeed I suspect. If the main claude instance compacts its context - stop and start from where you left off.

by conception

3/14/2026 at 4:24:34 AM

It actually did automatically break the work up into chunks and launched a bunch of parallel workers to each handle a smaller amount of work. It wasn't doing everything in a single instance.

The problem wasn't that it lost track of which changes it needed to make, so I don't think checking items off a todo list would have helped. I believe it did actually change all the places in the code it should have. It just made the wrong changes sometimes.

But also, the claim I was responding to was, "I start with a PRD, ask for a step-by-step plan, and just execute on each step at a time." If I have to tell it how to organize its work and how to keep track of its progress and how to execute all the smaller chunks of work, then I may get good results, but the tool isn't as magical (for me, anyway) as it seems to be for some other people.

by koreth1

3/14/2026 at 4:41:09 AM

The next line in the comment you’re responding to is

> Sometimes ideas are dumb, but checking and guiding step by step helps it ship working things in hours.

which matches my experience exactly. I consider it to be about as magical as the parent comment is claiming, but I wouldn’t call it totally automatic.

by monkpit

3/14/2026 at 5:33:07 AM

If you use eslint and tell it how to run lint in CLAUDE.md it will run lint itself and find and fix most issues like this.

Definitely not ideal, but sure helps.

by a13n

3/14/2026 at 7:47:38 AM

Undefined variable references? Did you not instruct it to run typescript after changes?

by jdkoeck

3/14/2026 at 11:13:24 AM

Start over, create a new plan with the lessons learned.

You need to converge on the requirements.

by stpedgwdgfhgdd

3/14/2026 at 4:34:28 AM

You’re using it wrong. As soon as it starts going off the rails once you’ve repeated yourself, you drop the whole session and start over.

by dyauspitr

3/14/2026 at 5:47:53 AM

One of the more subtle points that seems to be crucial is that it works a lot better when it can use the context as part of its own work rather than being polluted by unrelated details. Even better than restarting when it's off the rails is to avoid it as much as possible by proactively starting a new conversation as soon as anything in the history of the existing one stops being relevant. I've found it more effective to manually tell it most what's currently in the context in a fresh session skip the irrelevant bits even if they're fairly small than relying on it to figure out that it's no longer relevant (or give it instructions indicating that, which feels like a crapshoot whether it's actually going to prune or just bloat things further with that instruction just being added into the mix).

To echo what the parent comment said, it's almost frustrating how effective it can be at certain tasks that I wouldn't ever have the patience for. At my job recently I needed to prototype calling some Python code via WASM using the Rust wasmtime engine, and setting up the code structure to have the bytes for the WASM component, the arguments I wanted to pass to the function, and the WIT describing the interface for the function, it was able to fill in all of the boilerplate needed so that the function calls worked properly within a minute or two on the first try; reading through all the documentation and figuring out how exactly which half dozen assorted things I had to import and hook up together in the correct order would have probably taken me an hour at minimum.

I don't have any particular insight on whether or not these tools will become even more powerful over time, and I still have fairly strong concerns about how AI tools will affect society (both in terms of how they're used and the amount of in energy used to produce them in the first place), but given how much the tech industry tends to prioritize productivity over social concerns, I have to assume that my future employment is going to be heavily impacted by my willingness to adopt and use these tools. I can't deny at this point that having it as an option would make me more productive than if I refuse to use it, regardless of my personal opinions on it.

by saghm

3/14/2026 at 4:20:22 AM

What kinds of things are you building? This is not my experience at all.

Just today I asked Claude using opus 4.6 to build out a test harness for a new dynamic database diff tool. Everything seemed to be fine but it built a test suite for an existing diff tool. It set everything up in the new directory, but it was actually testing code and logic from a preexisting directory despite the plan being correct before I told it to execute.

I started over and wrote out a few skeleton functions myself then asked it write tests for those to test for some new functionality. Then my plan was to the ask it to add that functionality using the tests as guardrails.

Well the tests didn’t actually call any of the functions under test. They just directly implemented the logic I asked for in the tests.

After $50 and 2 hours I finally got something working only to realize that instead of creating a new pg database to test against, it found a dev database I had lying around and started adding tables to it.

When I managed to fix that, it decided that it needed to rebuild multiple docker components before each test and test them down after each one.

After about 4 hours and $75, I managed to get something working that was probably more code than I would have written in 4 hours, but I think it was probably worse than what I would have come up with on my own. And I really have no idea if it works because the day was over and I didn’t have the energy left to review it all.

We’ve recently been tasked at work with spending more money on Claude (not being more productive the metric is literally spending more money) and everyone is struggling to do anything like what the posts on HN say they are doing. So far no one in my org in a very large tech company has managed to do anything very impressive with Claude other than bringing down prod 2 days ago.

Yes I’m using planning mode and clearing context and being specific with requirements and starting new sessions, and every other piece of advice I’ve read.

I’ve had much more luck using opus 4.6 in vs studio to make more targeted changes, explain things, debug etc… Claude seems too hard to wrangle and it isn’t good enough for you to be operating that far removed from the code.

by sarchertech

3/14/2026 at 5:16:09 AM

Similar experience. I use these AI tools on a daily basis. I have tons of examples like yours. In one recent instance I explicitly told it in the prompt to not use memcpy, and it used memcpy anyway, and generated a 30-line diff after thinking for 20 minutes. In that amount of time I created a 10-line diff that didn't use memcpy.

I think it's the big investors' extremely powerful incentives manifesting in the form of internet comments. The pace of improvement peaked at GPT-4. There is value in autocomplete-as-a-service, and the "harnesses" like Codex take it a lot farther. But the people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away. This is not a hockey stick curve. It's a log curve.

Bigger context windows are a welcome addition. And stuff like JSON inputs is nice too. But these things aren't gonna like, take your SWE job, if you're any good. It's just like, a nice substitute for the Google -> Stack Overflow -> Copy/Paste workflow.

by jhatemyjob

3/14/2026 at 3:08:38 PM

Most devs aren't very good. That's the reality, it's what we've all known for a long time. AI is trained on their code, and so these "subpar" devs are blown away when they see the AI generate boring, subpar code.

The second you throw a novel constraint into the mix things fall apart. But most devs don't even know about novel constraints let alone work with them. So they don't see these limitations.

Ask an LLM to not allocate? To not acquire locks? To ensure reentrancy safety? It'll fail - it isn't trained on how to do that. Ask it to "rank" software by some metric? It ends up just spitting out "community consensus" because domain expertise won't be highly represented in its training set.

I love having an LLM to automate the boring work, to do the "subpar" stuff, but they have routinely failed at doing anything I consider to be within my core competency. Just yesterday I used Opus 4.6 to test it out. I checked out an old version of a codebase that was built in a way that is totally inappropriate for security. I asked it to evaluate the system. It did far better than older models but it still completely failed in this task, radically underestimating the severity of its findings, and giving false justifications. Why? For the very obvious reason that it can't be trained to do that work.

by staticassertion

3/15/2026 at 6:18:51 PM

The people glazing these tools can't design systems. I have this founder friend who I've known for decades, he knows how to code but he isn't really interested in it; he's more interested in the business side and mostly sees programming as a way to make money. Before ChatGPT he would raise money and hire engineers ASAP. When not a founder he would try to get into management roles etc etc. About a year ago he told me he doesn't really write code anymore, and he showed me part of his codebase for this new company he's building. To my horror I saw a 500-line bash script that he claimed he did not understand and just used prompts to edit it.

It didn't need to be a bash script. It could have been written in any scripting language. I presume it started off as a bash script because that's probably what it started out as when he was exploring the idea. And since it was already bash I guess he decided to just keep going with it. But it was just one of those things where I was like, these autocomplete services would never stop and tell you "maybe this 500-line script should be rewritten in python", it will just continue to affirm you, and pile onto the tech debt.

I used to freak out and think my days were numbered when people claimed they stopped writing code. But now I realize that they don't like writing code, don't care about getting better at it, don't know what good code looks like, and would hire an engineer if they could. With that framing, whenever I see someone say "Opus 4.6 is nuts. Everything I throw at it works. Frontend, backend, algorithms—it does not matter." I know for a fact that "everything" in that persons mind is very limited in scope.

Also, I just realized that there was an em-dash in that comment. So there's that. Wasn't even written by a person.

by jhatemyjob

3/15/2026 at 8:00:47 PM

> I know for a fact that "everything" in that persons mind is very limited in scope.

I agree, and I think it's quite telling what people are impressed by. Someone elsewhere said that Opus 4.6 is a better programmer than they are and... I mean, I kinda believe it, but I think it says way more about them than it does about Opus.

by staticassertion

3/15/2026 at 8:38:24 PM

Yep that's from the same comment I quoted. Decent chance it's not even a real person.

by jhatemyjob

3/14/2026 at 12:23:46 PM

> people who are blown away by these new releases either don't spend a lot of time writing code, or are being paid to be blown away

Careful, or you're going to get slapped by the stupid astroturfing rule... but you're correct. Also there's the sunk cost fallacy, post purchase rationalization, choice supportive bias, hell look at r/MyBoyfriendIsAI... some people get very attached to these bots, they're like their work buddies or pets, so you don't even need to pay them, they'll glaze the crap out it themselves.

by Culonavirus

3/14/2026 at 4:30:19 AM

Curious what language and stack. And have people at your company had marginally more success with greenfield projects like prototypes? I guess that’s what you’re describing, though it sounds like it’s a directory in a monorepo maybe?

by dcre

3/14/2026 at 4:35:56 AM

This was in Go, but my org also uses Typescript, and Elixir.

I’ve had plenty of success with greenfield projects myself but using the copilot agent and opus 4.5 and 4.6. I completely vibecoded a small game for my 4 year old in 2 hours. It’s probably 20% of the way to being production ready if I wanted to release it, but it works and he loves it.

And yes people have had success with very simple prototypes and demos at work.

by sarchertech

3/14/2026 at 2:33:04 PM

Try https://github.com/gsd-build/get-shit-done. It's been a game changer for me.

by JoeMerchant

3/14/2026 at 8:08:20 PM

> After about 4 hours and $75

Huh? The max plan is $200/month. How are you spending $75 in 4 hrs?

by godd2

3/15/2026 at 1:58:53 AM

Enterprise plan. We've been instructed that our goal is to spend at least as much as our salary.

by sarchertech

3/15/2026 at 2:22:32 AM

> is to spend at least as much as our salary

Reads as a very distopian "let's see how many people we can replace"

by FrostKiwi

3/14/2026 at 5:19:34 AM

You probably just don't have the hang of it yet. It's very good but it's not a mind reader and if you have something specific you want, it's best to just articulate that exactly as best you can ("I want a test harness for <specific_tool>, which you can find <here>"). You need to explain that you want tests that assert on observable outcomes and state, not internal structure, use real objects not mocks, property based testing for invariants, etc. It's a feedback loop between yourself and the agent that you must develop a bit before you start seeing "magic" results. A typical session for me looks like:

- I ask for something highly general and claude explores a bit and responds.

- We go back and forth a bit on precisely what I'm asking for. Maybe I correct it a few times and maybe it has a few ideas I didn't know about/think of.

- It writes some kind of plan to a markdown file. In a fresh session I tell a new instance to execute the plan.

- After it's done, I skim the broad strokes of the code and point out any code/architectural smells.

- I ask it to review it's own work and then critique that review, etc. We write tests.

Perhaps that sounds like a lot but typically this process takes around 30-45 minutes of intermittent focus and the result will be several thousand lines of pretty good, working code.

by extr

3/14/2026 at 2:32:12 PM

I absolutely have the hang of Claude and I still find that it can make those ridiculous mistakes, like replicating logic into a test rather than testing a function directly, talking to a local pg that was stale/ running, etc. I have a ton of skills and pre-written prompts for testing practices but, over longer contexts, it will forget and do these things, or get confused, etc.

You can minimize these problems with TLC but ultimately it just will keep fucking up.

by staticassertion

3/15/2026 at 6:33:30 AM

Don't know what to tell you. Sounds like you're holding it wrong. Based on the current state of things I would try to get better at holding it the right way.

by extr

3/15/2026 at 9:50:11 AM

I can't tell if you're joking?

by staticassertion

3/14/2026 at 4:48:27 PM

My favorite is when you need to rebuild/restart outside of claude and it will "fix the bug" and argue with you about whether or not you actually rebuilt and restarted whatever it is you're working on. It would rather call you a liar than realize it didn't do anything.

by withinboredom

3/14/2026 at 8:42:29 PM

this is a pretty annoying problem -- i just intentionally solve it by asking claude to always use the right build command after each batch of modifications, etc

by kyyol

3/14/2026 at 5:04:49 PM

"That's an old run, rebuild and the new version will work" lol

by staticassertion

3/14/2026 at 10:34:43 AM

With the back and forth refining I find it very useful to tell Claude to 'ask questions when uncertain' and/or to 'suggest a few options on how to solve this and let me choose / discuss'

This has made my planning / research phase so much better.

by Huppie

3/14/2026 at 5:36:59 AM

Yes pretty much my workflow. I also keep all my task.md files around as part of the repo, and they get filled up with work details as the agent closes the gates. At the end of each one I update the project memory file, this ensures I can always resume any task in a few tokens (memory file + task file == full info to work on it).

by visarga

3/14/2026 at 6:15:59 AM

Pretty good workflow. But you need to change the order of the tests and have it write the tests first. (TDD)

by __mharrison__

3/14/2026 at 6:25:48 AM

I mean I’ve been using AI close to 4 years now and I’ve been using agents off and on for over a year now. What you’re describing is exactly what I’m doing.

I’m not seeing anyone at work either out of hundreds of devs who is regularly cranking out several thousand lines of pretty good working code in 30-45 minutes.

What’s an example of something you built today like this?

by sarchertech

3/15/2026 at 6:59:10 AM

Fair, that's optimistic, and it depends what you're doing. Looking at a personal project I had a PR from this week at +3000 -500 that I feel quite good about, took about 2 nights of about an hour each session to shape it into what I needed (a control plane for a polymarket trading engine). Though if I'm being fair, this was an outlier, only possible because I very carefully built the core of the engine to support this in advance - most of the 3K LoC was "boilerplate" in the sense I'm just manipulating existing data structures and not building entirely new abstractions. There are definitely some very hard-fought +175 -25 changes in this repo as well.

Definitely for my day job it's more like a few hundred LoC per task, and they take longer. That said, at work there are structural factors preventing larger changes, code review, needing to get design/product/coworker input for sweeping additions, etc. I fully believe it would be possible to go faster and maintain quality.

by extr

3/15/2026 at 12:28:07 PM

Those numbers are much more believable, but now we’re well into maybe a 2-3x speed up. I can easily write 500 LOC in an hour if I know exactly what I’m building (ignoring that LOC is a terrible metric).

But now I have to spend more time understanding what it wrote, so best case scenario we’re talking maybe a 50% speed up to a part of my job that I spent maybe 10-20% on.

Making very big assumptions that this doesn’t add long term maintenance burdens or result in a reduction of skills that makes me worse at reviewing the output, it’s cool technology.

On par with switching to a memory managed language or maybe going from J2EE to Ruby on Rails.

by sarchertech

3/16/2026 at 5:01:12 AM

Thinking in terms of a "speed up multiplier" undersells it completely. The speed up on a task I would have never even attempted is infinite. For my +3000 PR recently on my polymarket engine control plane, I had no idea how these types of things are typically done. It would have taken me many hours to think through an implementation and hours of research online to assemble an understanding on typical best practices. Now with AI I can dispatch many parallel agents to examine virtually all all public resources for this at once.

Basically if it's been done before in a public facing way, you get a passable version of that functionality "for free". That's a huge deal.

by extr

3/14/2026 at 5:10:32 AM

I find that Opus misses a lot of details in the code base when I want it to design a feature or something. It jumps to a basic solution which is actually good but might affect something elsewhere.

GPT 5.4 on codex cli has been much more reliable for me lately. I used to have opus write and codex review, I now to the opposite (I actually have codex write and both review in parallel).

So on the latest models for my use case gpt > opus but these change all the time.

Edit: also the harness is shit. Claude code has been slow, weird and a resource hog. Refuses to read now standardized .agents dirs so I need symlink gymnastics. Hides as much info as it can… Codex cli is working much better lately.

by eknkc

3/14/2026 at 5:41:55 AM

Codex CLI is so much more pleasant to use than CC. I cancelled my CC subscription after the OpenCode thing, but somewhat ironically have recently found myself naturally trying the native Codex CLI client first more often over OpenCode.

Kinda funny how you don't actually need to use coercion if you put in the engineering work to build a product that's competitive on its own technical merits...

by toraway

3/14/2026 at 5:52:49 AM

Im convinced everyone saying this is building the simplest web apps, and doing magic tricks on themselves.

by ai_fry_ur_brain

3/14/2026 at 10:09:34 AM

I've been building a new task manager in C for Linux.

If you're not using AI you are cooked. You just don't realize it yet.

https://i.imgur.com/YXLZvy3.png

by hparadiz

3/14/2026 at 10:36:12 AM

> If you're not using AI you are cooked. You just don't realize it yet.

Truth. But not just “using”.

Because here’s where this ship has already landed: humans will not write code, humans will not review code.

I see mostly rage against this idea, but it is already here. Resistance is futile. There will be no “hand crafted software” shops. You have at most 3-4 years left if you think this is your job.

by popcorncowboy

3/14/2026 at 11:16:34 AM

I don't really agree.

People should still understand the code because sometimes the AI solution really is wrong and I have to shove my hand in it's guts and force it to use my solution or even explain the reasoning.

People should be studying architecture. Cause now I can orchestrate stuff that used to take teams and I would throwaway as a non-viable idea. Now I can just do it. But no you will still be reviewing code.

by hparadiz

3/14/2026 at 4:42:56 PM

Most people as at March 2026 still agree with you.

by popcorncowboy

3/14/2026 at 8:10:26 PM

People still understand metallurgy and casting even though machines make all the paperclips.

by godd2

3/14/2026 at 2:01:25 PM

Are you using AI to write this? Please stop.

by qweiopqweiop

3/14/2026 at 5:07:42 PM

It has subpar grammar (uncapitalized word "humans" and "hand crafted" is unhyphenated). I think you're hallucinating.

by arcanemachiner

3/14/2026 at 6:17:43 PM

Clearly so. To me it's the LLM writing style at least.

by qweiopqweiop

3/14/2026 at 4:43:46 PM

Said like a bot. Please stop.

by popcorncowboy

3/20/2026 at 1:11:59 AM

you're engaging w a troll

by chrisweekly

3/14/2026 at 5:55:47 AM

What evidence would convince you otherwise?

by raldi

3/14/2026 at 8:02:00 AM

Session dumps would be nice.

by isbvhodnvemrwvn

3/14/2026 at 12:45:46 PM

My experience is that it gets you 80-90% of the way at 20x the speed, but coaxing it into fixing the remaining 10-20% happens at a staggeringly slow speed.

All programming is like this to some extent, but Claude's 80/20 behavior is so much more extreme. It can almost build anything in 15-30 minutes, but after those 15-30 minutes are up, it's only "almost built". Then you need to spend hours, days, maybe even weeks getting past the "almost".

Big part of why everyone seems to be vibe coding apps, but almost nobody seems to be shipping anything.

by marginalia_nu

3/14/2026 at 4:02:07 AM

I had Opus 4.6 running on a backend bug for hours. It got nowhere. Turned out the problem was in AWS X-ray swizzling the fetch method and not handling the same argument types as the original, which led to cryptic errors.

I had Opus 4.6 tell me I was "seeing things wrong" when I tried to have it correct some graphical issues. It got stuck in a loop of re-introducing the same bug every hour or so in an attempt to fix the issue.

I'm not disagreeing with your experience, but in my experience it is largely the same as what I had with Opus 4.5 / Codex / etc.

by interpol_p

3/14/2026 at 5:11:46 AM

Haha, reminds me of an unbelievably aggravating exchange with Codex (GPT 5.4 / High) where it was unflinchingly gaslighting me about undesired behavior still occurring after a change it made that it was adamant simply could not be happening.

It started by insisting I was repeatedly making a typo and still would not budge even after I started copy/pasting the full terminal history of what I was entering and the unabridged output, and eventually pivoted to darkly insinuating I was tampering with my shell environment as if I was trying to mislead it or something.

Ultimately it turned out that it forgot it was supposed to be applying the fixes to the actual server instead of the local dev environment, and had earlier in the conversation switched from editing directly over SSH to pushing/pulling the local repo to the remote due to diffs getting mangled.

by toraway

3/14/2026 at 5:35:55 AM

I am starting to believe it’s not OPUS but developers getting better at using LLMs across the board. And not realizing they are just getting much better at using these tools.

I also thought it was OPUS 4.5 (also tested a lot with 4.6) and then in February switched to only using auto mode in the coding IDEs. They do not use OPUS (most of the times), and I’m ending up with a similar result after a very rough learning curve.

Now switching back to OPUS I notice that I get more out of it, but it’s no longer a huge difference. In a lot of cases OPUS is actually in the way after learning to prompt more effectively with cheaper models.

The big difference now is that I’m just paying 60-90$ month for 40-50hrs of weekly usage… while I was inching towards 1000$ with OPUS. I chose these auto modes because they don’t dig into usage based pricing or throttling which is a pretty sweet deal.

by fbrncci

3/14/2026 at 9:03:41 AM

Opus is not an acronym.

by danielbln

3/14/2026 at 11:47:05 AM

I know, but its certainly a new paradigm.

by fbrncci

3/14/2026 at 5:44:51 PM

O.P.U.S OutProgram U Soon

by devld

3/14/2026 at 7:15:44 AM

I had similar thoughts regarding "we are simply getting better at using them", but the man I tried Gemini again and reconsidered.

by copperx

3/14/2026 at 5:50:59 AM

> PRD

Is it Baader-Meinhof or is everyone on HN suddenly using obscure acronyms?

by olalonde

3/14/2026 at 5:56:42 AM

It stands for Product Requirements Document, it is something commonly used in project planning and management.

by shujito

3/14/2026 at 6:08:26 AM

Maybe so, but personally it seemed to be referred to as a "specification" or "spec" for a long time, and then suddenly around maybe 5 years ago I started to hear people use "PRD". I'm not sure what caused the change.

by epicureanideal

3/14/2026 at 8:02:36 AM

Yep, software specs or requirements[0]. Thanks to LLMs it's easy to look up acronyms, but still, it feels like there's an uptick of them on HN[1]...

[0] https://en.wikipedia.org/wiki/Software_requirements_specific...

[1] https://news.ycombinator.com/item?id=47323316 who the hell knows that version of "RSI"?

by olalonde

3/14/2026 at 12:57:59 PM

Seems commonly used in Big Tech - first time I heard it was in my current job. Now it's seared into my brain since it's used so much. Among many other acronyms which I won't bore you with.

by nvarsj

3/14/2026 at 4:02:42 AM

I had been able to get it into the classic AI loop once.

It was about a problem with calculation around filling a topographical water basin with sedimentation where calculation is discrete (e.g. turn based) and that edge case where both water and sediments would overflow the basin; To make the matter simple, fact was A, B, C, and it oscillated between explanation 1 which refuted C, explanation 2 which refuted A and explanation 3 that refuted B.

I'll give it to opus training stability that my 3 tries using it all consistently got into this loop, so I decided to directly order it to do a brute force solution that avoided (but didn't solve) this problem.

I did feel like with a human, there's no way that those 3 loop would happen by the second time. Or at least the majority of us. But there is just no way to get through to opus 4.6

by Aperocky

3/14/2026 at 7:07:46 AM

> It was also the first AI I felt, "Damn, this thing is smarter than me."

1000% agree. It's also easy to talk to it about something you're not sure it said and derive a better, more elegant solution with simple questioning.

Gemini 3.1 also gives me these vibes.

by schainks

3/14/2026 at 3:36:29 AM

> [...] with multiple agents working at the same time, each at that speed.

Horizontal parallelising of tasks doesn't really require any modern tech.

But I agree that Opus 4.6 with 1M context window is really good at lots of routine programming tasks.

by eru

3/14/2026 at 3:56:49 AM

Opus helped me brick my RPi CM4 today. It glibly apologized for telling to use an e instead of a 6 in a boot loader sequence.

Spent an hour or so unraveling the mess. My feeling are growing more and more conflicted about these tools. They are here to stay obviously.

I’m honestly uncertain about the junior engineers I’m working with who are more productive than they might be otherwise, but are gaining zero (or very little) experience. It’s like the future is a world where the entire programming sphere is dominated by the clueless non technical management that we’ve all had to deal with in small proportion a time or two.

by travisgriggs

3/14/2026 at 4:34:20 AM

> I’m honestly uncertain about the junior engineers I’m working with who are more productive than they might be otherwise, but are gaining zero (or very little) experience.

Well, (economic) progress means being able to do more with less. A Fordian-style conveyor belt factory can churn out cars with relatively unskilled labour.

Economising on human capital is economising on a scarce input.

We had these kinds of shifts before. Compare also how planes used to have a pilot, copilot and flight engineer. We don't have that anymore, but it used to be a place for people to learn. But pilot education has adapted.

Or check how spreadsheet software has removed a lot of the worst rote work in finance. That change happened perhaps in the 1980s. Finance has adapted.

> Opus helped me brick my RPi CM4 today. It glibly apologized for telling to use an e instead of a 6 in a boot loader sequence.

Yes, these things do best when they have a (simulated) environment they can make mistakes in and that can give them clear and fast feedback.

by eru

3/14/2026 at 5:19:12 AM

> Yes, these things do best when they have a (simulated) environment they can make mistakes in and that can give them clear and fast feedback.

This always felt like a reason to throw it at coding. With its rigid syntax you'll know quickly and cheaply if what was written passes an absolute minimaal level of quality.

by hvb2

3/14/2026 at 7:55:21 AM

Well, rigid syntax, type checkers, automated tests, etc. They all help.

by eru

3/16/2026 at 6:04:56 AM

Has this been shown to be true? Or just anecdotal?

by travisgriggs

3/16/2026 at 9:11:03 AM

Anecdotal from my own experience. But someone might have done a study by now: they are much cheaper and quicker to run on AIs than on undergrad students (yet alone professional devs).

Also pre-AI: when I set up nice property tests, I could develop much better even while a bit tired.

by eru

3/14/2026 at 9:24:17 AM

I've seen a few instances of where Claude showed me a better way to do something and many many more instances of where it fails miserably.

Super simple problem :

I had a ZMK keyboard layout definition I wanted it to convert it to QMK for a different keyboard that had one key less so it just had to trim one outer key. It took like 45 minutes of back and forth to get it right - I could have done it in 30 min manually tops with looking up docs for everything.

Capability isn't the impressive part it's the tenacity/endurance.

by rafaelmn

3/14/2026 at 4:05:37 AM

Opus-4.6 is so far ahead of the rest that I think Anthropic is the winner in winner-take-all

by hrishikesh-s

3/14/2026 at 5:22:34 AM

Codex doesn't seem that far behind. I use the top model available for api key use and its gotten faster this month even on the max effort level (not like a cheetah - more like not so damn painful anymore). Plus, it also forks agents in parallel - for speed & to avoid polluting the main context. I.e. it will fork explorer agents while investigating (kind of amusing because they're named after famous scientists).

by steve-atx-7600

3/14/2026 at 4:42:38 AM

It's so far the best model that answers my questions about Wolfram language.

That being said it's the only use case for me. I won't subscribe to something that I can't use with third party harness.

by raincole

3/14/2026 at 7:21:33 AM

I use a Claude sub with oh-my-pi, but I do so with lots of anxiety, knowing that I will be banned at any moment.

by copperx

3/14/2026 at 5:57:53 AM

I have a PhD in a niche field and this can do my job ;)

Not sure if this means I should get a more interesting job or if we are all going to be at the mercy of UBI eventually.

by fooker

3/14/2026 at 6:00:22 AM

We're never getting UBI. See the latest interview with the Palantir CEO where he talks about white collar workers having to take more hands-on jobs that they may not feel as satisfied with. IE - tending their manors and compounds.

RIP widespread middle class. It was a good 80-year run.

by suzzer99

3/14/2026 at 7:40:24 PM

I wonder if we are going to find more creative solutions.

by fooker

3/14/2026 at 6:01:22 AM

An economy, and likely a society, fails if everyone is at the mercy of a UBI.

by _heimdall

3/14/2026 at 7:42:15 PM

> An economy ... fails

We know this to be true with a reasonably degree of certainty

> likely a society

This one, not so much. We could potentially have pretty vibrant societies even if everyone is not ultra rich, not going on international vacations, not having access to buy things from the other end of the world subsidized by economies of scale.

by fooker

3/14/2026 at 7:19:09 AM

But what's the alternative? Can any economy succeed with a >50% unemployment rate?

by copperx

3/14/2026 at 5:20:58 PM

Don't confuse UBI and employment or even income though. If we find ourselves replacing or exceeding current productivity without humans working in the system we have to fundamentally rethink our system.

You likely wouldn't need money at all in that future, for example. What does the money really mean when everyone I'd guaranteed to have all the basics covered? Is money really helping to store value created via labor when there is no labor? And is money providing price discover when the cost of resources and manufacturing are moving towards zero?

If labor is replaced with tech, and I think that's a big if, I don't see any outcome other than a totalitarian distopia that will fail much like the Soviet Union.

by _heimdall

3/14/2026 at 7:42:46 PM

Brave new world ;)

by fooker

3/14/2026 at 11:36:12 AM

We don't really know yet, that's just speculation.

by fph

3/14/2026 at 5:23:01 PM

The replacement of human labor with tech is speculation. I don't see any way a future where we have a UBI because humans no longer work for a living ends well.

Sure I'm talking the future so its speculative, but I'd love to hear a scenario where it works well sustainably and doesn't turn into a totalitarian distopia.

by _heimdall

3/14/2026 at 7:51:02 AM

It’s still pretty poor writing powershell

by ed_elliott_asc

3/14/2026 at 7:11:07 PM

> It was also the first AI I felt, "Damn, this thing is smarter than me."

Sounds like it is.

by slopinthebag

3/14/2026 at 3:41:31 AM

I’ll put out a suggestion you pair with codex or deepthink for audit and review - opus is still prone to … enthusiastic architectural decisions. I promise you will be at least thankful and at most like ‘wtf?’ at some audit outputs.

Also shout out to beads - I highly recommend you pair it with beads from yegge: opus can lay out a large project with beads, and keep track of what to do next and churn through the list beautifully with a little help.

by vessenes

3/14/2026 at 3:54:49 AM

I've been pairing it with Codex using https://github.com/pjlsergeant/moarcode

The amount of genuine fuck-ups Codex finds makes me skeptical of people who are placing a lot of trust in Claude alone.

by petesergeant

3/14/2026 at 4:44:32 AM

Nice. Yeah I have them connect through beads, which combined with a git log is a lot of information - it feels smoother to me than this looks. But I agree with the sentiment. Codex isn't my favorite for understanding and implementing. But I appreciate the intelligence and pickiness very much.

by vessenes

3/14/2026 at 4:40:25 AM

Just yesterday I asked it to repeat a very simple task 10 times. It ended up doing it 15 times. It wasn't a problem per se, just a bit jarring that it was unable to follow such simple instructions (it even repeated my desire for 10 repetitions at the start!).

by scroogey

3/14/2026 at 3:50:07 AM

Opus 4.6 is AGI in my book. They won’t admit it, but it’s absolutely true. It shows initiative in not only getting things right but also adding improvements that the original prompt didn't request that match the goals of the job.

by dzink

3/14/2026 at 11:33:58 AM

> Opus 4.6 is AGI in my book.

Not even close. There are still tons of architectural design issues that I'd find it completely useless at, tons of subtle issues it won't notice.

I never run agents by themselves; every single edit they do is approved by me. And, I've lost track of the innumerable times I've had to step in and redirect them (including Opus) to an objectively better approach. I probably should keep a log of all that, for the sake of posterity.

I'll grant you that for basic implementation of a detailed and well-specced design, it is capable.

by prmph

3/14/2026 at 3:53:59 AM

On the adding improvements and being helpful thing, isn't that part of the system prompt?

by winrid

3/14/2026 at 4:41:38 AM

You could put whatever you wanted in the GPT-4 system prompt and it wasn't doing shit.

by dcre

3/14/2026 at 5:35:59 AM

True. I retract my sentiment :D

by winrid

3/14/2026 at 4:37:20 AM

I don’t know if Opus is AGI but on a broader note, that’s how we will get AGI. Not some consciousness like people are expecting. It’s just going to be chatbot that’s very hard to stump and starts making actual scientific breakthroughs and solving long standing problems.

by dyauspitr

3/14/2026 at 5:30:04 AM

I'll be more likely to agree with anything being AGI if it doesn't have such obvious and common brittleness. These LLMs all go off the rails when the context window gets large. Their context is also easy to "poison", and so it's better to rollback conversations that went bad rather than trying to steer them back to the light.

There's probably more examples, but to me AGI must move beyond the above issues. Though frankly context window might just be a symptom of poor harness than anything, still - it illustrates my general issue with them being considered AGI as it stands today.

Claude 4.6 is getting crazy good though, i'll give you that.

by unshavedyak

3/14/2026 at 7:58:52 AM

How are you rolling back a conversation? I didn't know tools exposed that functionality.

by copperx

3/14/2026 at 8:42:30 AM

For both claude-code or gemini-cli, hit escape twice, or, /rewind.

by NiloCK

3/14/2026 at 1:38:27 PM

But does it still generate slop?

I'm late to the party and I'm just getting started with Antrophic models. I have been finding Sonnet decent enough, but it seems to have trouble naming variables correctly (it's not just that most names are poor and undescriptive, sometimes it names it wrong, confusing) or sometimes unnecessarily declaring, re-declaring variables, encoding, decoding, rather than using the value that's already there etc. Is Opus better at this?

by devld

3/14/2026 at 5:11:21 PM

You really need to try it for yourself. People working in different domains get wildly different results.

by arcanemachiner

3/14/2026 at 4:24:07 AM

Bullshit.

by vips7L

3/14/2026 at 6:09:54 AM

[dead]

by wetpaws

3/14/2026 at 6:10:54 AM

[flagged]

by gregharned

3/14/2026 at 11:08:47 AM

LLM written comments are not allowed on HN. This comment is written by an LLM and the account is fresh.

by edot

3/14/2026 at 4:30:17 AM

The replies to this really make me think that some people are getting left behind the AI age. Colleges are likely already teaching how to prompt, but a lot of existing software devs just don't get it. I encourage people who aren't having success with AI to watch some youtube videos on best practices.

by phendrenad2

3/15/2026 at 12:28:40 AM

I am such a dev (backend).

I'm working in a codebase of 200+ "microservices", separate repos, each deployed as multiple FaaS, CQRS-style. None of it my choice, everything precedes me, many repos I know nothing of. Little to no code re-use between them.

Any trace of "business logic" is so distributed in multiple repos, that I have no possible use of LLM codegen, unless I can somehow feed it ALL the codebase.

I've tried generating some tests, but they always miss the mark, as in the system under test is almost always the wrong one.

I guess LLM are cool for greenfield, but when the brownfield is really brown, there's no use for LLMs.

by namtab00

3/14/2026 at 5:17:16 AM

Share one

by germinalphrase

3/14/2026 at 4:34:27 PM

Okay the process is simple. You're going to go to another website, called YouTube. Don't be alarmed. First read all the steps so you don't miss any, once you start going to the other site you won't be able to see this one. You might want to write these down on a piece of paper first. Okay here we go:

1. Click in the bar at the top of the page that says ycombinator.com 2. type this in: youtube.com 3. press enter 4. There will be a box at the top that says "search", click that 5. type in "tips and tricks for agentic coding" 6. press enter 7. a list of videos should appear, watch them all

by phendrenad2

3/14/2026 at 5:13:43 PM

But what if they find a bad one? There's a lot of junk out there.

by arcanemachiner

3/15/2026 at 10:08:32 PM

Yes, this can happen. Some people just have the worst luck, and will always find the bad one. This is why Greene made "stay away from unlucky people" law 10 out of 48. It's really that important.

by phendrenad2

3/14/2026 at 12:16:54 AM

This is super exciting. I've been poking at it today, and it definitely changes my workflow -- I feel like a full three or four hour parallel coding session with subagents is now generally fitting into a single master session.

The stats claim Opus at 1M is about like 5.4 at 256k -- these needle long context tests don't always go with quality reasoning ability sadly -- but this is still a significant improvement, and I haven't seen dramatic falloff in my tests, unlike q4 '25 models.

p.s. what's up with sonnet 4.5 getting comparatively better as context got longer?

by vessenes

3/14/2026 at 1:40:00 AM

Did it get better? I used sonnet 4.5 1m frequently and my impression was that it was around the same performance but a hell of a lot faster since the 1m model was willing to spends more tokens at each step vs preferring more token-cautious tool calls.

by steve-atx-7600

3/14/2026 at 2:32:01 AM

Opus 4.6 is wayy better than sonnet 4.5 for sure.

by vessenes

3/14/2026 at 1:37:25 AM

Random: are you personally paying for Claude Code or is it paid by you employer?

My employer only pays for GitHub copilot extension

by mattfrommars

3/14/2026 at 3:12:25 AM

GitHub Copilot CLI lets you use all these models (unless your employer disables them.

https://github.com/features/copilot/cli

Disclosure: work at Msft

by kiratp

3/14/2026 at 2:43:59 PM

Used Claude through copilot for so long before switching to CC. Even for the same model the difference is shocking. Copilot’s harness and the underlying Claude models are not well-matched compared to the vertically-integrated Claude Code harness.

by ericpauley

3/14/2026 at 3:45:51 AM

Disclosure: have to use them via copilot at work. Be glad I don’t write code for nuclear plants. Why does it have to be so hard. Doubly so in JetBrains ides but I’ve a feeling that’s on both of you rather than just you personally. But I still resent you now.

by tclancy

3/14/2026 at 2:02:21 AM

Both. Employer pays for work max 20x, i pay for a personal 10x for my side projects and personal stuff.

by celestialcheese

3/14/2026 at 12:46:41 AM

This is incredible. I just blew through $200 last night in a few hours on 1M context. This is like the best news I've heard all year in regards to my business.

What is OpenAIs response to this? Do they even have 1M context window or is it still opaque and "depends on the time of day"

by johnwheeler

3/14/2026 at 1:05:41 AM

rarely go over 25 percent in codex but i hit 80 on claude code in just a short time.

by dominotw

3/14/2026 at 12:57:57 AM

Did u use the API or subscription?

by hagen8

3/14/2026 at 12:59:58 AM

Max subscription and "extra usage" billing

by johnwheeler

3/14/2026 at 1:49:29 AM

That sounds high. I mean, if you paid for the 20x max plan you’d be capped at around 200/month and at least for me as a professional engineer running a few Claude’s in parallel all day, I haven’t exceeded the plans limits.

by steve-atx-7600

3/14/2026 at 2:22:33 AM

Prior to this announcement, all 1M context use consumed "extra usage", it wasn't included in a normal subscription plan.

by Wowfunhappy

3/14/2026 at 5:01:53 AM

So, I’ve been using opus 4.6 1m since it was fist available to 20x max users daily. What I think has happened is that even in doing so, I have not actually exceeded the plan token limits and therefore haven’t been charged for “extra usage” (just double checked). So, unless there’s a billing mistake or delay, “any usage” != “extra usage” which is what I was always unclear about. I am careful to iterate with claude on plans in plan mode followed by clearing the context and executing. I think I am hovering around the higher end of the smaller window model where I would have otherwise seen auto-compaction run.

Another reason for less token usage is that 4.6 is much better at delegating agents (its own explorer agents or my custom agents) to avoid cluttering the window.

by steve-atx-7600

3/14/2026 at 2:14:58 PM

I'm very happy about this change. For long sessions with Claude it was always like a punch to the gut when a compaction came along. Codex/GPT-5.4 is better with compactions so I switched to that to avoid the pain of the model suddenly forgetting key aspects of the work and making the same dumb errors all over again. I'm excited to return to Claude as my daily driver!

by iandanforth

3/14/2026 at 8:55:24 AM

This is amazing. I have to test it with my reverse engineering workflow. I don't know how many people use CC for RE but it is really good at it.

Also it is really good for writing SketchUp plugins in ruby. It one shots plugins that are in some versions better then commercial one you can buy online.

CC will change development landscape so much in next year. It is exciting and terrifying in same time.

by tariky

3/14/2026 at 2:05:56 AM

Do long sessions also burn through token budgets much faster?

If the chat client is resending the whole conversation each turn, then once you're deep into a session every request already includes tens of thousands of tokens of prior context. So a message at 70k tokens into a conversation is much "heavier" than one at 2k (at least in terms of input tokens). Yes?

by aragonite

3/14/2026 at 2:16:08 AM

That's correct. Input caching helps, but even then at e.g. 800k tokens with all of them cached, the API price is $0.50 * 0.8 = $0.40 per request, which adds up really fast. A "request" can be e.g. a single tool call response, so you can easily end up making many $0.40 requests per minute.

by dathery

3/14/2026 at 3:23:11 AM

Interesting, so a prompt that causes a couple dozen tool calls will end up costing in the tens of dollars?

by acjohnson55

3/14/2026 at 4:17:32 PM

It essentially depends on how many back-and-forth calls are required. If the model returns a request for multiple calls at once, then the reply can contain all responses and you only pay once.

If the model requests tool calls one-by-one (e.g. because it needs to see the response from the previous call before deciding on the next) then you have to pay for each back-and-forth.

If you look at popular coding harnesses, they all use careful prompting to try to encourage models to do the former as much as possible. For example opencode shouts "USING THE BATCH TOOL WILL MAKE THE USER HAPPY" [1] and even tells the model it did a good job when it uses it [2].

[1] https://github.com/anomalyco/opencode/blob/66e8c57ed1077814c... [2] https://github.com/anomalyco/opencode/blob/66e8c57ed1077814c...

by dathery

3/14/2026 at 8:09:45 AM

Not necessarily, take a look at ex OpenApi Responses resource, you can get multiple tool calls in one response and of course reply with multiple results.

by isbvhodnvemrwvn

3/14/2026 at 2:16:19 AM

If you use context cacheing, it saves quite a lot on the costs/budgets. You can cache 900k tokens if you want.

by jasondclinton

3/15/2026 at 11:44:19 AM

My companies brand guidelines document was 600 ish pages long and claude desktop couldnt handle it.

As soon as I saw the announcement , tried again and created a working design skill that can create design artifacts following the brand guidelines.

While these improvements seem incremental, they have a compounding effect on usefulness.

My AI doomsday calculator just got decremented by anothet 6 months.

by geminiboy

3/15/2026 at 4:19:06 AM

I've been using Claude Code directly on my production servers to debug complex I/O bottlenecks and database locks. The ability of the latest models to hold the entire project context while suggesting real-time fixes is a game changer for solo founders. It helped me stabilize a security tool I’m building when other agents kept hallucinating.

by Slav_fixflex

3/14/2026 at 5:18:29 PM

All while their usage limits are so excessively shitty that I paid them 50$ just two days back cause I ran out of usage and they still blocked from using it during a critical work week (and did not refund my 50$ despite my emails and requests and route me to s*ty AI bot.). Anyway, I am using Copilot and OpenCode a lot more these days which is much better.

by anshumankmr

3/14/2026 at 5:24:35 PM

What model(s) do you use with OpenCode? Can you use opus4.6 1m? Is it better in terms of usage if you use the same model?

by praddlebus

3/15/2026 at 6:35:28 AM

Sonnet4.6/Haiku4.5 for simple stuff.

by anshumankmr

3/14/2026 at 1:01:28 AM

Compared to yesterday my Claude Max subscription burns usage like absolutely crazy (13% of weekly usage from fresh reset today with just a handful prompts on two new C++ projects, no deps) and has become unbearably slow (as in 1hr for a prompt response). GGWP Anthropic, it was great while it lasted but this isn't worth the hundreds of dollars.

by pixelpoet

3/14/2026 at 1:03:22 AM

Yeah, morning eastern time Claude is brutal.

by Spooky23

3/14/2026 at 3:32:16 AM

Finally, I don't have to constantly reload my Extra Usage balance when I already pay $200/mo for their most expensive plan. I can't believe they even did that. I couldn't use 1M context at all because I already pay $200/mo and it was going to ask me for even more.

Next step should be to allow fast mode to draw from the $200/mo usage balance. Again, I pay $200/mo, I should at least be able to send a single message without being asked to cough up more. (One message in fast mode costs a few dollars each) One would think $200/mo would give me any measure of ability to use their more expensive capabilities but it seems it's bucketed to only the capabilities that are offered to even free users.

by LoganDark

3/14/2026 at 7:20:10 AM

I find it hard to understand that people consider $200 p/m a lot for what they are getting. Expensive compared to what? A netflix sub?

A 1hr of a senior dev is at least $100, depending where one lives. Since Claude saves me hours every day, it pays for itself almost instantly. I think the economic value of the Claude subscription is on the order of $20-40k a month for a pro.

by aenis

3/14/2026 at 10:31:53 AM

When did I say anything about what I'm getting? I said I pay $200/mo and I expect that to cover anything up to my usage limit. I don't expect any slightly non-standard configuration to immediately ignore the high subscription price that I pay and go straight to "extra usage" that has to be billed separately by the token. I wouldn't even care if fast mode used 10x or 50x the usage as long as I could actually USE the balance that I already pay for. I thought the point of extra usage was to be for overage.

by LoganDark

3/14/2026 at 5:27:28 PM

Fair point. I read your comment as '$200 is a lot, they shouldnt ask for more'. My bad!

by aenis

3/14/2026 at 8:40:40 PM

When you say "saves me", do you mean that you can prompt claude for like 1 hour per day, get your typical pre-AI output, and then go about doing whatever you like outside of work for rest of the day?

by megous

3/14/2026 at 6:01:35 AM

I've been avoiding context beyond 100k tokens in general. The performance is simply terrible. There's no training data for a megabyte of your very particular context.

If you are really interested in deep NIAH tasks, external symbolic recursion and self-similar prompts+tools are a much bigger unlock than more context window. Recursion and (most) tools tend to be fairly deterministic processes.

I generally prohibit tool calling in the first stack frame of complex agents in order to preserve context window for the overall task and human interaction. Most of the nasty token consumption happens in brief, nested conversations that pass summaries back up the call stack.

by bob1029

3/14/2026 at 10:29:32 AM

Could be pure coincidence, but my Claude Code session last night was an absolute nightmare. It kept forgetting things it had done earlier in the session and why it had done them, messed up a git merge so badly that it lost the CLAUDE.md file along with a lot of other stuff, and then started running commands on the host machine instead of inside the container because it no longer had a CLAUDE.md to tell it not to. Last night was the first time I've ever sworn at it.

by drcongo

3/14/2026 at 10:32:41 AM

I think this is just the nature of a nondeterministic system; occasionally you're gonna be unlucky enough to encounter the leftmost segment of the bell curve.

In my experience dumping a summary + starting a fresh session helps in these cases.

by xvector

3/15/2026 at 1:54:10 AM

Can someone help me with insights about large context models? Are there relationships that pop up at the beginning and end of long context windows that don't transitively follow from intermediate points? Is there value in the training over these longer windows vs using the more basic/closer weight distributions over different sliding windows?

by sporkland

3/14/2026 at 12:57:46 AM

What about response coherence with longer context? Usually in other models with such big windows I see the quality to rapidly drop as it gets past a certain point.

by margorczynski

3/14/2026 at 5:28:06 PM

The thing that would get me more excited is how far they could push context coherence before the model loses track. I'm hoping 250k.

by PeterStuer

3/14/2026 at 10:02:07 AM

I heard, the middle of the context is often ignored.

Do long context windows make much sense then or is this just a way of getting people to use more tokens?

by k__

3/14/2026 at 3:54:06 PM

I'm fairly sure that your best throughput is single-prompt single-shot runs with Claude (and that means no plan, no swarms, etc) -- just with a high degree of work in parallel.

So for me this is a pretty huge change as the ceiling on a single prompt just jumped considerably. I'm replaying some of my less effective prompts today to see the impact.

by jwilliams

3/14/2026 at 5:33:12 AM

The stuff I built with Opus 4.6 in the past 2.5 weeks:

Full clone of Panel de Pon/Tetris attack with full P2P rollback online multiplayer: https://panel-panic.com

An emulator of the MOS 6502 CPU with visual display of the voltage going into the DIP package of the physical CPU: https://larsdu.github.io/Dippy6502/

I'm impressed as fuck, but a part of me deep down knows that I know fuck all about the 6502 or its assembly language and architecture, and now I'll probably never be motivated to do this project in a way that I would've learned all the tings I wanted to learn.

by LarsDu88

3/14/2026 at 2:40:50 PM

That game is AWESOME! The fact that was vibe coded is insane.

by adamm255

3/14/2026 at 4:02:37 PM

Honestly that game wasnt oneshotted. I had longtine PdP enthusiasts play it and guve feedback

by LarsDu88

3/14/2026 at 1:33:34 AM

Awesome.... With Sonnet 4.5, I had Cline soft trigger compaction at 400k (it wandered off into the weeds at 500k). But the stability of the 4.6 models is notable. I still think it pays to structure systems to be comprehensible in smaller contexts (smaller files, concise plans), but this is great.

(And, yeah, I'm all Claude Code these days...)

by chaboud

3/14/2026 at 2:47:24 PM

> Standard pricing now applies across the full 1M window for both models, with no long-context premium.

Does that mean it's likely not a Transformer with quadratic attention, but some other kind of architecture, with linear time complexity in sequence length? That would be pretty interesting.

by cubefox

3/14/2026 at 2:58:01 PM

It's almost certainly not quadratic at 1M. This would be wildly infeasible at scale. 10^6^2 = 10^12. That's a trillion things.

They are probably doing something like putting the original user prompt into the model's environment and providing special tools to the model, along with iterative execution, to fully process the entire context over multiple invokes.

I think the Recursive Language Model paper has a very good take on how this might go. I've seen really good outcomes in my local experimentation around this concept:

https://arxiv.org/abs/2512.24601

You can get exponential scaling with proper symbolic stack frames. Handling a gigabyte of context is feasible, assuming it fits the depth first search pattern.

by bob1029

3/15/2026 at 12:54:22 AM

So that would mean it's not "really" a 1M context window. I guess this is more plausible than a linear architecture like MAMBA or GDN.

by cubefox

3/14/2026 at 3:43:50 PM

They're probably taking shortcuts such as taking advantage of sparsity. There are various tricks like that mentioned in some papers, although the big companies are getting more and more secretive about how their models work so you won't necessarily find proof.

by FartyMcFarter

3/15/2026 at 8:58:05 AM

The latest DeepSeek model has sparse attention. Though sparse attention is still not linear. Close enough perhaps.

by cubefox

3/14/2026 at 9:26:13 AM

I am currently mass translating millions of records with short descriptions. Somehow tokens are consumed extremely fast. I have 3 max memberships. And all 3 of them are hitting the 5 hour limit in about 5 to 10 minutes. Still don't understand why this is happening.

by holoduke

3/14/2026 at 9:56:17 AM

Unless you're clearing up the context for each description or processing them in parallel with subagents your context window will grow for each short description added to it making you hit those hour limits.

by cbg0

3/14/2026 at 7:00:55 AM

Sample of one and all that, but it's way, way more sloppy than it used to be for me.

To the extent, that I have started making manual fixes in the code - I haven't had to stoop to this in 2 months.

Max subscription, 100k LOC codebases more or less (frontend and backend - same observations).

by aenis

3/14/2026 at 9:20:19 AM

This blew my mind the first i saw this. Another leap in AI that just swooshes by. In a couple of months, every model will be the same. Can't wait for IDEs like cursor and vs code to update their tooling to adap for this massive change in claude models.

by suheilaaita

3/14/2026 at 10:22:17 AM

1M is truly amazing. However, what is the incidence of hallucination? I haven't found a benchmark, but I feel that maintaining context at 1M would likely increase hallucination. Is there some kind of mechanism to suppress hallucination?

by yubainu

3/14/2026 at 1:03:54 AM

can someone tell me how to make this instruction work in claude code

"put high level description of the change you are making in log.md after every change"

works perfectly in codex but i just cant get calude to do it automatically. I always have to ask "did you update the log".

by dominotw

3/14/2026 at 4:34:55 AM

use claude hooks - in .claude/settings.json you can have it run on different claude events like "PreToolUse" or "Stop" and in those events you pass in commands you want it to run.

You can have stuff like for the "stop" event, run foobar.sh and in foobar.sh do cool stuff like format your code, run tests, etc.

by sergiotapia

3/14/2026 at 1:48:11 AM

I imagine you can do this with a hook that fires every time claude stops responding:

https://code.claude.com/docs/en/hooks-guide

by prettyblocks

3/14/2026 at 5:31:25 AM

whats the need? you have the session in a file as a dag. you can summarize to a log whenever you want. doesnt need to be as it goes.

earlier today i actually spent a bit of time asking claude to make an mcp to introspect that - break the session down into summarized topics, so i could try dropping some out or replacing the detailed messages with a summary - the idea being to compact out a small chunk to save on context window, rather than getting it back to empty.

the file is just there though, you can run jq against it to get a list of writes, and get an agent to summarize

by 8note

3/14/2026 at 11:43:02 AM

i dont work in just one session though. some tasks take me days and many sessions. also what happens when your session compacts. I am not sure what you are suggesting here. what do you do with these summarized topics from your session.

Also i want ci to resume my task from log and do code review with that context.

https://www.anthropic.com/engineering/effective-harnesses-fo...

"Read the git logs and progress files to get up to speed on what was recently worked on."

by dominotw

3/14/2026 at 1:50:56 AM

Backup your config and ask Claude. I’ve done this for all kinds of things like mcp and agent config.

by steve-atx-7600

3/15/2026 at 12:42:29 AM

I just tested this with Jupytwr Notebooks for a day. LLMs have struggled with them because notebooks contain a lot of token as the data of rendered cells.

With Opus 1M, LLM edit was very robust and finally useable

by miohtama

3/14/2026 at 12:56:56 AM

The no-degradation-at-scale claim is the interesting part. Context rot has been the main thing limiting how useful long context actually is in practice — curious to see what independent evals show on retrieval consistency across the full 1M window.

by vicchenai

3/14/2026 at 1:24:28 AM

I don't think they're claiming "no degradation at scale", are they? They still report a 91.9->78.3 drop. That's just a better drop than everyone else (is the claim).

by apetresc

3/16/2026 at 3:14:02 PM

Is "generally available" the proper wording when Claude is generally unavailable so often these days?

by glimshe

3/14/2026 at 2:06:02 PM

My testing was extremely disappointing, this is not a context window that magically extends your breathing room for a conversation. I can tell blindly at this point when 150 - 200 k tokens are reached because the coding quality and coherence just drops by one or two generations. Its great for the case you really need a giant context for specific task but it changes nothing for needing to compact or handover at 200k.

by jFriedensreich

3/14/2026 at 5:21:16 PM

I feel like I'm the only one here using AI as just a chatbot for research, shopping, advice etc and for one off regex or bash/ps scripts... then again not a programmer so.

by heraldgeezer

3/14/2026 at 3:03:45 PM

This is great news. The 1M context is much easier to work with than compacting all the time and seems to perform and remember quite well despite the insane amount of data.

by sailfast

3/14/2026 at 3:49:57 PM

I used this for a bit and I felt like it was slower and generally worse than using 200K with context compaction. Context compaction does lose some things though.

by thebigspacefuck

3/14/2026 at 12:56:35 AM

I'm getting close to my goal of fitting an entire bootstrappable-from-source system source code as context and just telling Claude "go ahead, make it better".

by gaigalas

3/14/2026 at 10:13:19 AM

I never get to more than 20% of the 1M context window, and it’s working great. (Have the same experience in Codex with 5.4.)

by mvrckhckr

3/14/2026 at 1:46:38 AM

This is fantastic. I keep having to save to memory with instructions and then tell it to restore to get anywhere on long running tasks.

by arjie

3/14/2026 at 12:42:24 AM

Are there evals showing how this improves outputs?

by aliljet

3/14/2026 at 1:27:42 AM

Improves outputs relative to what? Compared to previous contexts of 1M, it improves outputs by allowing them to exist (because previously you couldn't exceed 200K). Compared to contexts of <200K, it degrades outputs rather than improves them, but that's what you'd expect from longer contexts. It's still better than compaction, which was previously the alternative.

by apetresc

3/14/2026 at 3:02:22 PM

Am I crazy or wasn’t this announced like 2 weeks ago?

Or was that a different company or not GA. It’s all becoming a blur.

by AbstractH24

3/14/2026 at 3:31:34 PM

Do subscription users still need to tap into "extra usage" spending to go above 200K tokens?

by jmkozko

3/14/2026 at 5:09:30 AM

Hot take... the 1MM context degrades performance drastically.

by aarmenante

3/14/2026 at 7:16:31 AM

Same. First time in 2 months that I found it easier to fix the bugs it created manually, rather than get it to fix. Its google-code-CLI-on-gemini-2.5 level bad for me today. Meaning, almost comically bad.

by aenis

3/15/2026 at 4:26:37 AM

Remember folks, just because you can use 1m tokens doesn't mean you should

by TZubiri

3/14/2026 at 2:05:08 AM

Just have to ask. Will I be spending way more money since my context window is getting so much bigger?

by thunkle

3/14/2026 at 8:11:50 AM

Yes, full context is used to generate each new token.

by isbvhodnvemrwvn

3/14/2026 at 12:38:17 AM

Oh nice, does it mean less game of /compact, /clear, and updating CLAUDE.md with Claude Code?

by 8cvor6j844qw_d6

3/14/2026 at 1:05:07 AM

I’ve been using 1M for a while and it defers it and makes it worse almost when it happens. Compacting a context that big loses a ton of fidelity. But I’ve taken to just editing the context instead (double esc). I also am planning to build an agent to slice the session logs up into contextually useful and useless discarding the useless and keeping things high fidelity that way. (I.e., carve up with a script the jsonl and have subagent haiku return the relevant parts and reconstructing the jsonl)

by fnordpiglet

3/14/2026 at 1:06:43 AM

til you can edit context. i keep a running log and /clear /reload log

by dominotw

3/14/2026 at 3:12:40 AM

double escape gets you to a rewind. not sure about much else.

the conversation history is a linked list, so you can screw with it, with some care.

I spend this afternoon building an MCP do break the conversation up into topics, then suggest some that aren't useful but are taking up a bunch of context to remove (eg iterations through build/edit just needs the end result)

its gonna take a while before I'm confident its worth sharing

by 8note

3/14/2026 at 3:51:35 PM

Yeah just selective rewind. Selective edit where you elide large token sinks of coding and banging its head on the wall is what u mean. Not something I’ve seen done yet but there’s no reason - I suspect if you do a token use distribution in programming session most goes to pretty low semantic value malarkey.

by fnordpiglet

3/14/2026 at 11:40:58 AM

yea i thought session managment was some sort of secret sauce.

I keep a running log of important things and then i just clear context and reload that file into context.

would that work

by dominotw

3/14/2026 at 2:56:57 AM

im guessing this is why the compacts have started sucking? i just finished getting me some nicer tools for manipulating the graph so i could compact less frequently, and fish out context from the prior session.

maybe itll still be useful, though i only have opus at 1M, not sonnet yet

by 8note

3/14/2026 at 4:31:44 AM

maybe i'm thinking too small, or maybe it's because i've been using these ai systems since they were first launched, but it feels wrong to just saturate the hell out of the context, even if it can take 1 million tokens.

maybe i need to unlearn this habit?

by sergiotapia

3/14/2026 at 7:41:36 AM

I think your instinct is right. More context isn't free, even when the window supports it, and the model still has to attend to everything in there, and noise dilutes the signal. A cleaner, smaller context consistently gives better outputs than a bloated one, regardless of window size. For sure, the 1M window is great for not having to compact mid-task. But "I can fit more" and "I should put more in" are very different things. At least in my mind.

by gskm

3/14/2026 at 1:58:58 AM

I notice Claude steadily consuming less tokens, especially with tool calling every week too

by swader999

3/14/2026 at 6:23:35 AM

I don't get the announcement. Is this included in the standard 5 or 20x Max plans?

by fittingopposite

3/14/2026 at 3:49:17 AM

Pentagon may switch to Claude knowing OpenAI has the premium rates for 1M context.

by throw03172019

3/14/2026 at 12:18:41 AM

Noticed this just now - all of a sudden i have 1M context window (!!!) without changing anything. It's actually slightly disturbing because this IS a behavior change. Don't get me wrong, I like having longer context but we really need to pin down behaviour for how things are deployed.

by zmmmmm

3/14/2026 at 1:45:04 AM

You can pin to specific models with —-model. Check out their doc. See https://support.claude.com/en/articles/11940350-claude-code-.... You can also pin to a less specific tag like sonnet-4.5[1m] (that’s from memory might be a little off).

by steve-atx-7600

3/14/2026 at 2:49:25 AM

sure - but the model hasn't changed. I'm specifying it explicitly. But suddenly the context window has. I'm not using Claude Code, this is an application built against Bedrock APIs. I assume there's a way I could be specifying the context window and I'm just using API defaults. But it definitely makes me wonder what else I'm not controlling that I really should be.

by zmmmmm

3/14/2026 at 12:52:30 AM

Anthropic is famous for changing things under your feet. Claude code is basically alpha software with a global footprint.

by phist_mcgee

3/14/2026 at 3:38:15 AM

Is this also applicable for usage in Claude web / mobile apps for chat?

by dkpk

3/14/2026 at 6:28:51 AM

If this is a skill issue, feel free to let me know. In general Claude Code is decent for tooling. Onduty fullstack tooling features that used to sit ignored in the on-caller ticket queue for months can now be easily built in 20 minutes with unit tests and integration tests. The code quality isn't always the best (although what's good code for humans may not be good code for agents) but that's another specific and directed prompt away to refactor.

However, I can't seem to get Opus 4.6 to wire up proper infrastructure. This is especially so if OSS forks are used. It trips up on arguments from the fork source, invents args that don't exist in either, and has a habit of tearing down entire clusters just to fix a Helm chart for "testing purposes". I've tried modifying the CLAUDE.md and SPEC.md with specific instructions on how to do things but it just goes off on a tangent and starts to negotiate on the specs. "I know you asked for help with figuring out the CNI configurations across 2 clusters but it's too complex. Can we just do single cluster?" The entire repository gets littered with random MD files everywhere for directory specific memories, context, action plans, deprecated action plans, pre-compaction memories etc. I don't quite know which to prune either. It has taken most of the fun out of software engineering and I'm now just an Obsidian janitor for what I can best describe as a "clueless junior engineer that never learns". When the auto compaction kicks in it's like an episode of 50 first dates.

Right now this is where I assume is the limitation because the literature for real-world infrastructure requiring large contexts and integration is very limited. If anyone has any idea if Claude Opus is suitable for such tasks, do give some suggestions.

by alienchow

3/14/2026 at 4:28:08 AM

Friends, just write the code. It’s not that hard.

by vips7L

3/14/2026 at 5:32:23 AM

I hear what you're saying, but for a lot of people coding isn't something we can throw 40+ hours per week at.

My main job is running a small eComm business, and I have to both develop software automations for the office (to improve productivity long-term) while also doing non-coding day to day tasks. On top of this, I maintain an open source project after hours. I've also got a young family with 3 kids.

I'm not saying Claude is the damn singularity or anything, but stuff is getting done now that simply wasn't being addressed before.

by AussieWog93

3/14/2026 at 6:34:23 AM

100% agree with this, as much as I hate the term "game-changer"... it truly is, I'm working on projects that I've always wanted to do but never had the capacity (or money to pay a small team of devs to build something)-- all these things that you thought you'd never have a chance to do, are suddenly now real and completely possible. I know there's a lot of AI haters out there but I'm pretty sure in time, all devs will embrance it and truly enjoy working with it

by fixxation92

3/14/2026 at 7:00:28 AM

If anyone thought there was value to those projects they would have paid for it before.

by vips7L

3/14/2026 at 11:15:33 AM

Yeah, and likely still pay for it now (hopefully!)

by fixxation92

3/14/2026 at 7:16:51 AM

You're witnessing the rise of the Developer Technician or Software Technician. They can get a machine to print out an application but you will still need an engineer to know how it works or to get it working. This used to be juniors learning to be senior devs/engineers. Now it is a split between technicians and engineers. The market will be up shit creek when all their technicians can't vibe code their way out of not understanding the code.

by righthand

3/14/2026 at 10:47:48 AM

Not hard, but time consuming. In the past two weeks I've had Claude Code write me around 35k lines of code across 350 commits. It's a project which is giving positive impact to the company, but we would never have started it without CC as the effort would have been too big compared to the impact.

by mrgaro

3/14/2026 at 8:31:28 AM

It's not that interesting.

by nkzd

3/14/2026 at 4:32:28 AM

Only someone not using Claude could equate human coding.

by andrewstuart

3/14/2026 at 4:51:32 AM

Only someone not using their brain could equate Claude to using their intelligence.

by vips7L

3/14/2026 at 5:00:16 AM

Let’s just clear this up …….. are you commenting with experience using the latest Claude, or are you commenting from personal beliefs.

It’s fine for you to take a stand, but please understand your position is simply factually wrong if you think you can outprogram Claude for a range of common tasks.

Being anti AI is fine, but if you deny facts of how far LLM programming has come then you lack credibility.

The most effective anti AI position is to acknowledge it’s power, not pretend that vast numbers of people are somehow hallucinating the power of LLM assisted programming.

by andrewstuart

3/14/2026 at 5:06:25 AM

I absolutely can out program Claude. I can factually guarantee that. You’re factually wrong in your belief that you think a statistical model that scientifically takes the average of programming is better than those of us that actually know what we’re doing.

Programming is not hard. You’re just lazy.

by vips7L

3/14/2026 at 10:24:59 PM

Programming is one of those things where everyone thinks they are above average, like driving.

If most people are average, then just use the bot to spit out average code. Or better yet, if you are above average, use it to spit out average code and then you need above average code, then write it.

Most programming is boilerplate. Its not faster to type boilerplate than t is to have the robot do it. I promise.

Programming is hard. If it wasn't you wouldn't be getting paid what you do.

by motoxpro

3/14/2026 at 5:10:20 AM

Ok so you speak with certainty about the capabilities of something you don’t use and therefore have no experience of.

Childish and naive.

If you said you’ve been using Claude heavily and it’s never done better than you on your own, then your position would be credible.

by andrewstuart

3/14/2026 at 5:12:05 AM

Sure pal. Keep outsourcing your job. I’ll be here when you need help and are unemployed.

by vips7L

3/14/2026 at 5:25:59 AM

That’s… not how the labour market works

by thatguymike

3/14/2026 at 5:37:27 AM

Of course. That’s because the labor market prefers cost over quality. The labour market will always prefer cheap and fast code that works at first glance. That is how capitalism works. That has nothing to do with my capabilities. It has nothing to do with the fact that I will always outperform a shitty statistical model. It has everything to do with the fact that most of you are too lazy to think. It has everything to do with most of you sucking and being too lazy to your job.

by vips7L

3/14/2026 at 6:10:08 AM

I think you need to take a deep breath and calm down.

by hewasahaterboy

3/14/2026 at 6:12:46 AM

Perfectly calm mate. Maybe you should try to factually argue against my position? Probably not though. Your account was created 30 minutes ago and likely a bot.

by vips7L

3/14/2026 at 3:48:55 PM

Not a bot, just annoyed at disrespectful people.

by hewasahaterboy

3/14/2026 at 7:33:20 AM

My account was created 14 years ago. You need to calm down.

There is a reason discussions about agent use have been on Hacker News every other day, and it's not a grand conspiracy. Even in this submission, people have talked about how they have used Claude Code and its longer context window successfully as a tool for programming, even if they may be technically skilled to do it themselves. However, if you assume that every commenter is acting in bad faith, then there's no point in continuing.

by minimaxir

3/14/2026 at 8:31:16 PM

I’m not going to defend the tone of the OP, and it is clearly wrong to assume that everyone who is pro AI is a shill or bot.

That being said, I’ve seen hard evidence that pro AI bots do exist on HN.

And at the very large tech company I work at there is a push for everyone to spend more on Claude Code regardless of output. The metric is literally how much you’re spending on Claude Code not how much you’re producing (and in my org we’ve seen no measurable increase in productivity). People are legitimately trying to figure out the easiest way to get it to blow through their allocated credits.

I use AI all the time, I find that Opus 4.6 is great for all kinds of tasks. I don’t think it’s all just hype, but there’s clearly some serious astroturfing going on here, and I understand the urge to be suspicious of everyone.

by sarchertech

3/14/2026 at 10:10:21 PM

That's a better argument. That said, by definition, many distinct people with different affiliations and incentives can't astroturf, as what would be the point?

by minimaxir

3/14/2026 at 11:56:53 PM

Bots from a single company can amplify (retweet, upvote, comment in support of) comments and stories from many different individuals to steer the conversation to some extent.

I know that pro LLM bots exist. Here’s an example https://news.ycombinator.com/threads?id=dirtytoken7

If you look at the timestamps you’ll see instances of it posting faster than a human could.

I know that there are numerous companies with hundreds of billions of valuation predicated on AI being better than just a useful addition to the programmers toolbox.

There are even more companies making millions off of the current hype.

People in these companies now have access to tools that can generate spam that’s nearly indistinguishable from ham. Of course some of them are using that capability.

Of course existence of astroturfing by itself doesn’t imply that that astroturfing is effective.

For that I’d point to other evidence. My personal experience with LLMs doesn’t match the hype. The experience of every single close programmer friend whose technical ability I trust, doesn’t match the hype. The output of my organization doesn’t match the hype. I can’t find any publicly verifiable numbers that match the hype. No new operating systems, no new browsers, no vibe coded hit games, the number of games released on Steam hasn’t gone up drastically, the number of apps on the Apple App Store has, but not if you filter out apps that are just wrappers for LLM APIs. Multiple studies show no impact on GDP, publicly traded software companies are showing large impacts to their bottom line etc…

Then you have things like my company pushing people to spend hundreds of thousands (per person) on Claude with zero productivity requirements. This is weird. I think the most likely explanation is an artificially inflated hype cycle.

by sarchertech

3/14/2026 at 6:18:21 PM

It’s not that hard and yet Claude can’t do it?

Why should I spend my mental energy doing simple things just to avoid being perceived as “lazy”? I have endless other engineering work to do other than typing code.

by therealdrag0

3/14/2026 at 11:53:12 AM

As someone mentioned on this thread, I can also easily out-engineer Claude Opus, lol its not even close.

Note that I'm not talking about the low-level grunt work (and even with that, its just that it is tedious and time-consuming, but if I had enough time to read through all the docs and stuff, I will almost always produce grunt code of much higher quality).

But I'm more talking about architecture, the stuff of proper higher level engineering. I use Claude Opus all the time, and I cannot even count how many times I've had to redirect its approach that was obviously betraying a complete lack of seeing the big picture, or some egregiously smelly architectural approach.

Also, expressive typing. I use mostly TypeScript, and it will often give up when I try to push it beyond a certain point, and resort to using "any". Then I have to step and do the job myself.

by prmph

3/15/2026 at 3:01:47 AM

Why would anyone need this much context? Genuine question. It's not worth the drop in quality IMO.

by hnipps

3/15/2026 at 9:00:45 AM

You sound like: "Why would anyone need more than 640KB RAM?!".

by tommek4077

3/16/2026 at 11:33:23 PM

That’s just not comparable. I’ll use your figure: If you use 400KB RAM for a process, using the remaining 240KB for something else doesn’t degrade the performance of the initial process (assuming nothing is trying to use more than the available RAM). Each unit of RAM is independent no?

Every token of context you use causes a drop in LLM performance.

More RAM == more processes can be supported with no degradation

More context != more stuff can be done with no degradation

by hnipps

3/14/2026 at 11:11:03 AM

finally. before 1m, i must speak 60k context for just telling the past chat and project

by efeecllk

3/14/2026 at 5:19:08 PM

Have we reached the point where its "normal" to mostly use AI to code? Im just wondering because Im sure it was less than a month ago when I said I havent coded manually for over 6 months and I had several comments about how my code must be terrible.

Im not butt hurt Im just wondering if the overton window has shifted yet.

by ionwake

3/14/2026 at 4:35:36 PM

i think it's buggy. i keep getting "compacting conversation" even though i restarted the cli. and i'm for sure not using 5 times more.

by ofisboy

3/14/2026 at 2:40:29 AM

Has anyone started a project to replace Linux yet?

by nemo44x

3/14/2026 at 11:24:21 AM

No, because it's not a hello-world Electron/React "app".

by dude250711

3/14/2026 at 8:51:25 AM

are the costs the same as the 200k context opus 4.6?

compaction has been really good in claude we don't even recognize the switch

by shanjai_raj7

3/14/2026 at 2:34:23 AM

is this the market played in front of our eyes slice by slice: ok, maybe not, but watching these entities duke it out is kinda amusing? There will be consequences but may as well sit it out for the ride, who knows where we are going?

by alienbaby

3/15/2026 at 12:17:36 AM

[dead]

by aneyadeng

3/14/2026 at 6:02:42 PM

[dead]

by JulianPembroke

3/14/2026 at 4:52:33 PM

[dead]

by elophanto_agent

3/14/2026 at 5:31:52 PM

[dead]

by aplomb1026

3/14/2026 at 2:42:46 AM

[dead]

by sunilgentyala

3/14/2026 at 6:41:25 AM

[dead]

by STARGA

3/14/2026 at 1:30:30 AM

[dead]

by sriramgonella

3/14/2026 at 2:49:04 AM

[dead]

by sysutil_dev

3/14/2026 at 1:45:55 PM

[dead]

by A7OM

3/14/2026 at 3:19:54 AM

[flagged]

by causalzap

3/14/2026 at 6:12:31 AM

Out of curiosity, what specific use cases on programmatic SEO are you currently doing with Opus?

by arizen

3/14/2026 at 4:23:15 AM

[flagged]

by aneyadeng

3/14/2026 at 1:47:55 PM

[dead]

by A7OM

3/14/2026 at 7:18:37 AM

[dead]

by jeff_antseed

3/14/2026 at 7:56:49 AM

[flagged]

by gskm

3/14/2026 at 12:02:10 PM

[dead]

by myrak

3/14/2026 at 7:16:18 PM

[dead]

by kopollo

3/14/2026 at 9:59:31 AM

[flagged]

by haha12122121

3/14/2026 at 8:10:03 PM

[dead]

by hirehalai

3/14/2026 at 9:38:37 PM

[flagged]

by MorkMindy74

3/14/2026 at 2:02:32 PM

[flagged]

by olivercoleai

3/14/2026 at 2:13:38 PM

I have no experience building this two-pass approach, but I arrived at it intuitively while planning for a new project. Any references to actual implementations?

by pqdbr

3/16/2026 at 6:18:27 PM

This is someone's out of control bot.

by cwillu

3/14/2026 at 4:59:51 PM

[flagged]

by genyk1

3/14/2026 at 4:33:59 AM

there is a parallel between managing context windows and hard real-time system engineering.

A context window is a fixed-size memory region. It is allocated once, at conversation start, and cannot grow. Every token consumed — prompt, response, digression — advances a pointer through this region. There is no garbage collector. There is no virtual memory. When the space is exhausted, the system does not degrade gracefully: it faults.

This is not metaphor by loose resemblance. The structural constraints are isomorphic:

No dynamic allocation. In a hard realtime system, malloc() at runtime is forbidden — it fragments the heap and destroys predictability. In a conversation, raising an orthogonal topic mid-task is dynamic allocation. It fragments the semantic space. The transformer's attention mechanism must now maintain coherence across non-contiguous blocks of meaning, precisely analogous to cache misses over scattered memory.

No recursion. Recursion risks stack overflow and makes WCET analysis intractable. In a conversation, recursion is re-derivation: returning to re-explain, re-justify, or re-negotiate decisions already made. Each re-entry consumes tokens to reconstruct state that was already resolved. In realtime systems, loops are unrolled at compile time. In LLM work, dependencies should be resolved before the main execution phase.

Linear allocation only. The correct strategy in both domains is the bump allocator: advance monotonically through the available region. Never backtrack. Never interleave. The "brainstorm" pattern — a focused, single-pass traversal of a problem space — works precisely because it is a linear allocation discipline imposed on a conversation.

by jf___

3/14/2026 at 9:43:34 AM

There is compaction, which is analogous to gc

by rhubarbtree