12/30/2025 at 9:11:43 PM
This is pretty recent - the survey they ran (99 respondents) was August 18 to September 23 2025 and the field observations (watching developers for 45 minute then a 30 minute interview, 13 participants) were August 1 to October 3.The models were mostly GPT-5 and Claude Sonnet 4. The study was too early to catch the 5.x Codex or Claude 4.5 models (bar one mention of Sonnet 4.5.)
This is notable because a lot of academic papers take 6-12 months to come out, by which time the LLM space has often moved on by an entire model generation.
by simonw
12/31/2025 at 7:51:51 AM
> academic papers take 6-12 months to come out, by which time the LLM space has often moved on by an entire model generation.This is a recurring argument which I don't understand. Doesn't it simply mean that whatever conclusion they did was valid then? The research process is about approximating a better description of a phenomenon to understand it. It's not about providing a definitive answer. Being "an entire model generation" behind would be important if fundamental problems, e.g. no more hallucinations, would be solved but if it's going from incremental changes then most likely the conclusions remain correct. Which fundamental change (I don't think labeling newer models as "better" is sufficient) do you believe invalidate their conclusions in this specific context?
by utopiah
12/31/2025 at 10:21:04 AM
2025 has been a wild year for agentic coding models. Cutting-edge models in January 2025 don't hold a candle to cutting edge models in December 2025.Just the jump from Sonnet 3.5 to 3.7 to 4.5, and Opus 4.5 has been pretty massive in terms of holistic reasoning, deep knowledge as well as better procedural and architectural adherence.
GPT-5 Pro convinced me to pay $200/mo for an OpenAI subscription. Regular 5.2 models, and 5.2 codex, are leagues better than GPT-4 when it comes to solving problems procedurally, using tools, and deep discussion of scientific, mathematic, philosophical and engineering problems.
Models have increasingly longer context, especially some Google models. OpenAI has released very good image models, and great editing-focused image models in general have been released. Predictably better multimodal inference over the short term is unlocking many cool near-term possibilities.
Additionally, we have seen some incredible open source and open weight models released this year. Some fully commercially viable without restriction. And more and more smaller TTS/STT projects are in active development, with a few notable releases this year.
Honestly, the landscape at the end of the year is impressive. There has been great work all over the place, almost too much to keep up with. I'm very interested in the Genie models and a few others.
For an idea:
At the beginning of the year, I was mildly successful getting at coding models to make changes in some of my codebases, but the more esoteric problems were out of reach. Progress in general was deliberate and required a lot of manual intervention.
By comparison, in the last week I've prototyped six applications at levels that would take me days to weeks individually, often developing multiple at the same time, monitoring agentic workflows and intervening only when necessary, relying on long preproduction phases with architectural discussions and development of documentation, requirements, SDDs... and detailed code review and refactoring processes to ensure adherence to constraints. I'm morphing from a very busy solo developer into a very busy product manager.
by soulofmischief
12/31/2025 at 12:04:29 PM
> Just the jump from Sonnet 3.5 to 3.7 to 4.5, and Opus 4.5 has been pretty massive in terms of holistic reasoning, deep knowledge as well as better procedural and architectural adherence.I don't really agree. Aside from how it handled frontend code, changes in Sonnet did not truly impact my overall productivity (from Sonnet 3.7 to 4 to 4.5, i did not try 3.5). Opus 4.5/Codex 5.2 are when the changes truly happenned for me (and i'm still a bit distrustfull of Codex 5.2, but i use it basically to help me during PRs).
by orwin
12/31/2025 at 6:18:12 PM
That's fine. Maybe you're holding it wrong, or maybe your work is too esoteric/niche/complex for newer models to be bigger productivity boosters. Some of mine certainly is, I get that. But for other stuff, these newer models are incredible productivity boosters.I also chat with these models for long hours about deep, complicated STEM subjects and am very impressed with the level of holistic knowledge and wisdom compared to models a year ago. And the abstract math story has gotten sooooo much better.
by soulofmischief
12/31/2025 at 12:02:45 PM
>By comparison, in the last week I've prototyped six applications at levels that would take me days to weeks individually [...]I don't doubt that the models have got better, but you can go back two or three years and find people saying the exact same stuff about the latest models back then.
by foldr
12/31/2025 at 2:32:30 PM
I don't think that's true of three years ago - that's taking us back into GPT-3 territory.And two years ago we were mostly still stuck with GPT-4 which had an 8,000 input context limit, very challenging to get real coding work done with that.
Easy enough to prove though, find some examples of people saying that 2-3 years ago and I shall concede the point!
by simonw
12/31/2025 at 3:27:31 PM
GPT-4 was released in March 2023, so it pretty clearly comes under the heading of “two or three years” ago. It’s only three months shy of its third birthday.I see that 2023 LinkedIn has (deservedly) gone down your memory hole, but it is very easy to find innumerable examples of people saying this kind of thing:
https://www.reddit.com/r/ChatGPTCoding/comments/11zu7l7/i_bu...
by foldr
12/31/2025 at 7:21:01 PM
Good link, I shall concede the point!by simonw
12/31/2025 at 6:15:13 PM
Crazy how progress works! It just keeps getting better, and people have rightfully noticed.by soulofmischief
12/31/2025 at 2:35:25 PM
The problem is with how people interpret these results.A paper comes out that says "we did a study of developers and found that AI-assistance had no impact on their productivity (using the state of the art models available in September 2024) and a lot of people will point to that as incontestable evidence that "AI doesn't work".
by simonw
12/31/2025 at 3:34:35 AM
I’m glad someone else noticed the time frames — turns out the lead author here has published 28 distinct preprints in the past 60 days, almost all of which are marked as being officially published already/soon.Certainly some scientists are just absurdly efficient and all 28 involved teams, but that’s still a lot.
Personally speaking, this gives me second thoughts about their dedication to truly accurately measuring something as notoriously tricky as corporate SWE performance. Any number of cut corners in a novel & empirical study like this would be hard to notice from the final product, especially for casual readers…TBH, the clickbait title doesn’t help either!
I don’t have a specific critique on why 4 months is definitely too short to do it right tho. Just vibe-reviewing, I guess ;)
by bbor
12/31/2025 at 5:13:58 AM
are they a PI with a lab? in this field, does the PI get first or last author?by aaronblohowiak
12/31/2025 at 5:17:03 AM
For what it’s worth I know this is likely intended to read as the new generation of models will somehow better than any paper will be able to gauge, that hasn’t been my experience.Results are getting worse and less accurate, hell, I even had Claude drop some Chinese into a response out of the blue one day.
by ActionHank
12/31/2025 at 9:33:38 AM
I can absolutely not corroborate this, Opus 4.5 has been nothing but stellar.by danielbln
12/31/2025 at 9:39:28 AM
same here. While getting a commandline for ffmpeg instead of giving me the option "soft-knee" it used "soft-膝" (where 膝 is the chinese for knee) was easy to spot and figure out but still... pretty rubbishy ¯ \ _ (ツ) _ / ¯by mannycalavera42
12/30/2025 at 9:13:03 PM
> academic papers take 6-12 months to come outIt takes about 6 months to figure out how to get LaTeX to position figures where you want them, and then another 6 months to fight with reviewers
by dheera
12/31/2025 at 12:01:28 AM
Couldn't AI help with the LaTeX?Cutting it down to 6 minutes
by zeristor
12/31/2025 at 1:06:52 AM
I have found it to be pretty bad at formatting tablesby jsrozner
12/30/2025 at 11:07:07 PM
I knew in October the game had changed. Thanks for keeping us in the know.by reactordev
12/31/2025 at 8:05:37 AM
I'm not sure what you mean by “the game has changed.” If you’re referring to Opus 4.5, it’s somewhat better, but it’s far from game-changing.by mikasisiki
12/31/2025 at 12:28:20 PM
You’re looking in from the outside. I’m on the inside. This next generation of models will show. It’s about to get wild.We now have extremely large context windows, we now have memory, we now have recall, we now can put an agent to the task for 24 hours.
by reactordev
12/30/2025 at 9:16:11 PM
Thanks Simon - always quick on the draw.Off your intuition, do you think the same study with Codex 5.2 and Opus 4.5 would see even better results?
by joenot443
12/30/2025 at 9:20:20 PM
Depends on the participants. If they're cutting-edge LLM users then yes, I think so. If they continue to use LLMs like they would have back in the first half of 2025 I'm not sure if a difference would be noticeable.by simonw
12/30/2025 at 10:05:44 PM
I'm not remotely cutting edge (just switched from Cursor to Codex CLI, have no fancy tooling infrastructure, am not even vaguely considering git worktrees as a means of working), but Opus 4.5 and 5.2 Codex are both so clearly more competent than previous models that I've started just telling them to do high-level things rather than trying to break things down and give them subtasks.If people are really set in their ways, maybe they won't try anything beyond what old models can do, and won't notice a difference, but who's had time to get set in their ways with this stuff?
by mkozlows
12/30/2025 at 10:41:00 PM
I mostly agree, but today, Opus 4.5 via Claude code did something pretty dumb stuff in my codebase— N queries where one would do, deep array comparison where a reference equality check would suffice, very complex web of nested conditionals which a competent developer would have never written, some edge cases where the backend endpoints didn’t properly verify user permissions before overwriting data, etc.It’s still hit or miss. The product “worked” when I tested it as a black box, but the code had a lot of rot in it already.
Maybe that stuff no longer matters. Maybe it does. Time will tell.
by christophilus
12/30/2025 at 10:47:04 PM
I have had a lot of success lately when working with Opus 4.5 using both the Beads task tracking system and the array of skills under the umbrella of Bad Dave's Robot Army. I don't have a link handy, but you should be able to find it on GitHub. I use the specialized skills for different review tasks (like Architecture Review, Performance Review, Security Review, etc.) on every completed task in addition to my own manual review, and I find that that helps to keep things from getting out of hand.by remich
12/30/2025 at 11:12:08 PM
As someone who’s responsible for some very clean codebases and some codebases that grew over many years, warts and all, I always wonder if being subjected to large amounts of not-exactly-wonderful code has the same effect on an LLM that it arguably also has on human developers (myself included occasionally): that they subconsciously lower their normally high bar for quality a bit, as in „well there‘s quite some smells here, let’s go a bit with the flow and not overdo the quality“.by ManuelKiessling
12/31/2025 at 12:17:31 AM
I don't think they generally one-shot the tasks; but they do them well enough that you can review the diff and make requests for changes and have it succeed in a good outcome more quickly than if you were spoon-feeding it little tasks and checking them as you go (as you used to have to do).by mkozlows
12/31/2025 at 4:43:04 AM
Also not a cutting edge user, but do run my own LLM's at home and have been spending a lot of time with Claude CLI last few months.It's fine if you want Claude to design your API's without any input, but you'll have less control and when you dig down into the weeds you'll realise it's created a mess.
I like to take both a top-down and bottoms-up approach - design the low level API with Claude fleshing out how it's supposed to work, then design the high level functionality, and then tell it to stop implementing when it hits a problem reconciling the two and the lower level API needs revision.
At least for things I'd like to stand the test of time, if its just a throwaway script or tool I care much less as long as it gets the job done.
by nineteen999
12/30/2025 at 10:54:51 PM
What's the difference between using llms now vs the first half of 2025 among the best users?by drbojingle
12/30/2025 at 10:59:44 PM
Coding agents and much better models. Claude Code or Codex CLI plus Claude Opus 4.5 or GPT 5.2 Codex.The latest models and harnesses can crunch on difficult problems for hours at a time and get to working solutions. Nothing could do that back in ~March.
I shared some examples in this comment: https://news.ycombinator.com/item?id=46436885
by simonw
12/30/2025 at 11:27:56 PM
Ok I will bite.Every single example you gave is in a hobby project territory. Relatively self-contained, maintainable by 3-4 devs max, within 1k-10k lines of code. I've been successfully using coding agents to create such projects for the past year and it's great, I love it.
However, lots of us here work on codebases that are 100x, 1000x the size of these projects you and Karpathy are talking about. Years of domain specific code. From personal experience, coding agents simply don't work at that scale the same way they do for hobby projects. Over the past year or two, I did not see any significant improvement from any of the newest models.
Building a slightly bigger hobby project is not even close to making these agents work at industrial scale.
by William_BB
12/31/2025 at 5:10:28 AM
I think that in general there is a big difference between javascript/typescript projects big or small and other projects that actually address a specific project domain. These two are not the same. The same claude code agent can create a lot of parts of a function web project, but will struggle providing anything functional but a base frame for you to build on if you were to create a new SoC support in some drone firmware.The problem is that everyone working on those more serious projects knows that and treats LLMs accordingly, but the people that come from the web space come in with the expectation that they can replicate the success they have in their domain just as easily, when oftentimes you need to have some domain knowledge.
I think the difference simply comes down to the sheer volume of training material, i.e. web projects on github. Most "engineers" are actually just framework consumers and within those frameworks llms work great.
by rjzzleep
12/31/2025 at 12:00:23 AM
Most of the stuff I'm talking about here came out in November. There hasn't been much time for professional teams to build new things with it yet, especially given the holidays!by simonw
12/31/2025 at 9:03:41 AM
For what it's worth, I'm working with it on a huge professional monorepo, and the difference was also stark.by qweiopqweiop
12/31/2025 at 5:14:10 AM
For what it’s worth, I have Claude coding away at Unreal Engine codebase. That’s a pretty large c++ codebase and it’s having no trouble at all. Just a cool several million lines of C++ lovely.by reactordev
12/31/2025 at 3:30:27 AM
Everything is made of smaller parts. I'd like to think we can sub divide a code base into isolated modules at least.by drbojingle
12/31/2025 at 6:52:35 PM
Depends on what kinds of problems you're solving...I'd put it in line with monolith vs microservices... You're shifting complexity somewhere, if it's on orchestration or the codebase. In the end, the piper gets paid.
Also, not all problems can be broken down cleanly into smaller parts.
by tracker1
12/31/2025 at 4:32:09 AM
In the real world, not all problems decompose nicely. In fact, I think it may be the case that the problems we actually get paid to solve with code are often of this type.by devin
1/1/2026 at 3:38:10 PM
Problems like?by drbojingle
12/31/2025 at 12:30:10 AM
That’s right, but it also hints at a solution: split big code bases into parts that are roughly the size of a big hobby project. You’ll need to write some docs to be effective at it, which also helps agents. CICD means continuous integration continuous documentation now.by baq
12/31/2025 at 1:49:50 AM
Splitting one big codebase into 100 microservices always seems tempting, except that big codebases already exist in modules and that doesn't stop one module's concerns from polluting the other modules' code. What you've got now is 100 different repositories that all depend on each other, get deployed separately, and can only be tested with some awful docker-compose setup. Frankly, given the impedance of hopping back and forth between repos separated by APIs, I'd expect an LLM to do far worse in a microservice ecosystem than in an equivalent monolith.by bccdee
12/31/2025 at 1:05:35 AM
I wonder if anyone has tried this thing before, like... micro-projects or such... ;)by majormajor
12/31/2025 at 5:12:22 AM
It's not the size that's the issue, it's the domain that is. It's tempting to say that adding drivers to Linux is hard because Linux is big, but that's not the issue.by rjzzleep
12/31/2025 at 2:52:29 AM
I worked at Slack earlier this year. Slack adopted Cursor as an option in December of 2024 if memory serves correctly. I had just had a project cut due to a lot of unfortunate reasons so I was working on it with one other engineer. It was a rewrite of a massive and old Python code base that ran Slack's internal service catalog. The only reason I was able to finish rewrites of the backend, frontend, and build an SLO sub-system is because of coding agents. Up until December I'd been doing that entire rewrite through sixteen hour days and just pure sweat equity.Again, that codebase is millions of lines of Python code and frankly the agents weren't as good then as they are now. I carefully used globbing rules in Cursor to navigate coding and testing standards. I had a rule that functioned as how people use agents.md now, which was put on every prompt. That honestly got me a lot more mileage than you'd think. A lot of the outcomes of these tools are how you use them and how good your developer experience is. If professional software engineers have to think about how to navigate and iterate on different parts of your code, then an LLM will find it doubly difficult.
by oooyay
12/31/2025 at 8:40:59 AM
Cool, but most developers do mundane stuff like glueing APIs and implementing business logic, which require oversight and review.Those crunching hard problems will still review what's produced in search of issues.
by epolanski
12/31/2025 at 10:16:14 AM
What is (in general) mundane about business logic? This can be highly complex, with deep process integration all over your modules.by generic92034
12/31/2025 at 8:47:58 PM
Which is why it requires detailed oversight.by epolanski
12/31/2025 at 12:22:11 AM
I was going back and looking at timelines, and was shocked to realize that Claude Code and Cursor's default-to-agentic-mode changes both came out in late February. Essentially the entire history of "mainstream" agentic coding is ten months old.(This helps me understand better the people who are confused/annoyed/dismissive about it, because I remember how dismissive people were about Node, about Docker, about Postgres, about Linux when those things were new too. So many arguments where people would passionately talk about all those things were irredeemably stupid and only suitable for toy/hobby projects.)
by mkozlows
12/31/2025 at 3:54:45 AM
The entire history of RL-trained "reasoning models" from o1 to DeepSeek_R1 is basically just a year old!by HarHarVeryFunny
12/31/2025 at 3:52:25 AM
Are there techniques though? Tech pairing? Something we know now that we didn't then? Or just better models?by drbojingle
12/31/2025 at 4:43:03 AM
Lots of technique stuff. A common observation among LLM nerds is that if the models stopped being improved and froze in time for a year we could still spend all twelve months discovering new capabilities and use-cases for the models we already have.by simonw
12/31/2025 at 1:39:57 PM
Any specifics you'd recommend?by drbojingle
12/31/2025 at 12:43:08 AM
[flagged]by trq126154