Claude Code Found a Linux Vulnerability Hidden for 23 Years

4/4/2026 at 12:35:21 PM

Pasting a big batch of new code and asking Claude "what have I forgotten? Where are the bugs?" is a very persuasive on-ramp for developers new to AI. It spots threading & distributed system bugs that would have taken hours to uncover before, and where there isn't any other easy tooling.

I bet there's loads of cryptocurrency implementations being pored over right now - actual money on the table.

by mattbee

4/4/2026 at 6:48:04 PM

I like biasing it towards the fact that there is a bug, so it can't just say "no bugs! all good!" without looking into it very hard.

Usually I ask something like this:

"This code has a bug. Can you find it?"

Sometimes I also tell it that "the bug is non-obvious"

Which I've anecdotally found to have a higher rate of success than just asking for a spot check

by merlindru

4/4/2026 at 11:33:51 PM

Do you not run into too many false positives around "ah, this thing you used here is known to be tricky, the issue is..."

I've seen that when prompting it to look for concurrency issues vs saying something more like "please inspect this rigorously to look for potential issues..."

by majormajor

4/4/2026 at 11:59:55 PM

What's more useful is to have it attempt to not only find such bugs but prove them with a regression test. In Rust, for concurrency tests write e.g. Shuttle or Loom tests, etc.

by cmrdporcupine

4/5/2026 at 12:08:32 AM

It would be generally good if most code made setting up such tests as easy as possible, but in most corporate codebases this second step is gonna require a huge amount of refactoring or boilerplate crap to get the things interacting in the test env in an accurate, well-controlled way. You can quickly end up fighting to understand "is the bug not actually there, or is the attempt to repro it not working correctly?"

(Which isn't to say don't do it: I think this is a huge benefit you can gain from being able to refactor more quickly. Just to say that you're gonna short-term give yourself a lot more homework to make sure you don't fix things that aren't bugs, or break other things in your quest to make them more provable/testable.)

by majormajor

4/5/2026 at 1:45:25 AM

That is an unfortunate case you described, but also, git gud and write tests in the first place so you don't need to refactor things down the road.

by simulator5g

4/6/2026 at 2:47:57 PM

yes but i can identify those easily. i know that if it flags something that is obviously a non issue, i can discard it.

...because false positives are good errors. false negatives is what i'm worried about.

i feel massively more sure that something has no big oversights if multiple runs (or even multiple different models) cannot find anything but false positives

by merlindru

4/4/2026 at 9:01:44 PM

Just in case you didn't read the full article, this is how they describe finding the bugs in the Linux kernel as well.

Since it's a large codebase, they go even more specific and hint that the bug is in file A, then try again with a hint that the bug is in file B, and so on.

by Nition

4/6/2026 at 2:51:08 PM

very interesting. i think "verbal biasing" and "knowing how to speak" in general is a really important thing with LLMs. it seems to massively affect output. (interestingly, somewhat less with Opus than with GPT-5.4 and Composer 2. Opus seems to intuit a little better. but still important.)

it's like the idea behind the book _The Mom Test_ suddenly got very important for programming

by merlindru

4/5/2026 at 12:16:45 AM

As a meta activity, I like to run different codebases through the same bug-hunt prompt and compare the number found as a barometer of quality.

I was very impressed when the top three AIs all failed to find anything other than minor stylistic nitpicks in a huge blob of what to me looked like “spaghetti code” in LLVM.

Meanwhile at $dayjob the AI reviews all start with “This looks like someone’s failed attempt at…”

by jiggawatts

4/5/2026 at 5:33:29 AM

> so it can't just say "no bugs! all good!"

If anyone, or anything, ever answers a question like that, you should stop asking it questions.

by kgwxd

4/4/2026 at 10:12:47 PM

You just have to be careful because it will sometimes spot bugs you could never uncover because they’re not real. You can really see the pattern matching at work with really twisted code. It tends to look at things like lock free algorithms and declare it full of bugs regardless of whether it is or not.

by wat10000

4/5/2026 at 9:44:42 AM

I have seen it start on a sentence, get lost and finish it with something like "Scratch that, actually it's fine."

And if it's not giving me a reason I can understand for a bug, I'm not listening to it! Mostly it is showing me I've mixed up two parameters, forgotten to initialise something, or referenced a variable from a thread that I shouldn't have.

The immediate feedback means the bug usually gets a better-quality fix than it would if I had got fatigued hunting it down! So variables get renamed to make sure I can't get them mixed up, a function gets broken out. It puts me in the mind of "well make sure this idiot can't make that mistake again!"

by mattbee

4/4/2026 at 1:12:10 PM

> Pasting a big batch of new code and asking Claude "what have I forgotten? Where are the bugs?"

It's actually the main way I use CC/codex.

by dvfjsdhgfv

4/4/2026 at 1:45:36 PM

I find Codex sufficiently better for it that I’ve taught Claude how to shell out to it for code reviews

by petesergeant

4/4/2026 at 2:32:29 PM

Ditto, I made a "/codex-review" skill in Claude Code that reviews the last git commit and writes an analysis of it for Claude Code to then work. I've had very good luck with it.

One particularly striking example: I had CC do some work and then kicked off a "/codex-review" and while it was running went to test the changes. I found a deadlock but when I switched back to CC the Codex review had found the deadlock and Claude Code was already working on a fix.

by linsomniac

4/5/2026 at 12:02:33 AM

I think OpenAI has actually released an official version of exactly this: https://community.openai.com/t/introducing-codex-plugin-for-...

https://github.com/openai/codex-plugin-cc

I actually work the other way around. I have codex write "packets" to give to claude to write. I have Claude write the code. Then have Codex review it and find all the problems (there's usually lots of them).

Only because this month I have the $100 Claude Code and the $20 Codex. I did not renew Anthropic though.

by cmrdporcupine

4/4/2026 at 8:04:33 PM

Yeah and it comes with the blood of children included

by motbus3

4/4/2026 at 2:51:29 PM

[dead]

by vaginaphobic

4/5/2026 at 7:21:41 AM

I usually do several passes of "review our work. Look for things to clean up, simplify, or refactor." It does usually improve the quality quite a lot; then I rewind history to before, but keep the changes, and submit the same prompt again, until it reaches the point of diminishing returns.

by 9dev

4/5/2026 at 3:49:27 AM

> It spots threading & distributed system bugs that would have taken hours to uncover before, and where there isn't any other easy tooling.

Go has a built in race detector which may be useful for this too: https://go.dev/doc/articles/race_detector

Unsure if it's suitable for inclusion in CI, but seems like something worth looking into for people using Go.

by justinclift

4/5/2026 at 4:40:24 PM

ive gone down this rabbit hole and i dunno, sometimes claude chases a smoking gun that just isn't a smoking gun at all. if you ask him to help find a vulnerability he's not gonna come back empty handed even if there's nothing there, he might frame a nice to have as a critical problem. in my exp you have to have build tests that prove vulnerabilities in some way. otherwise he's just gonna rabbithole while failing to look at everything.

ive had some remarkable successes with claude and quite a few "well that was a total waste of time" efforts with claude. for the most part i think trying to do uncharted/ambitious work with claude is a huge coinflip. he's great for guardrailed and well understood outcomes though, but im a little burnt out and unexcited at hearing about the gigantic-claude exercises.

by trueno

4/4/2026 at 3:47:16 PM

> "Codex wrote this, can you spot anything weird?"

by slig

4/4/2026 at 5:08:48 PM

[dead]

by tosti

4/4/2026 at 10:42:17 PM

[dead]

by aiedwardyi

4/4/2026 at 9:46:38 AM

Not "hidden", but probably more like "no one bothered to look".

declares a 1024-byte owner ID, which is an unusually long but legal value for the owner ID.

When I'm designing protocols or writing code with variable-length elements, "what is the valid range of lengths?" is always at the front of my mind.

it uses a memory buffer that’s only 112 bytes. The denial message includes the owner ID, which can be up to 1024 bytes, bringing the total size of the message to 1056 bytes. The kernel writes 1056 bytes into a 112-byte buffer

This is something a lot of static analysers can easily find. Of course asking an LLM to "inspect all fixed-size buffers" may give you a bunch of hallucinations too, but could be a good starting point for further inspection.

by userbinator

4/4/2026 at 5:19:13 PM

"No one bothered to look" is how most vulnerabilities work. Systems development produces code artifacts with compounding complexity; it is extraordinarily difficult to keep up with it manually, as you know. A solution to that problem is big news.

Static analyzers will find all possible copies of unbounded data into smaller buffers (especially when the size of the target buffer is easily deduced). It will then report them whether or not every path to that code clamps the input. Which is why this approach doesn't work well in the Linux kernel in 2026.

by tptacek

4/4/2026 at 6:03:35 PM

With a capable static analyzer that is not true. In many common cases they can deduce the possible ranges of values based on branching checks along the data flow path, and if that range falls within the buffer then it does not report it.

by rubendev

4/4/2026 at 7:09:58 PM

Be specific. Which analyzer are you talking about and which specific targets are you saying they were successful at?

by tptacek

4/4/2026 at 9:59:45 PM

Intrinsa's PREfix static source code analyzer would model the execution of the C/C++ code to determine values which would cause a fault.

IIRC they were using a C/C++ compiler front end from EDG to parse C/C++ code to a form they used for the simulation/analysis.

see https://web.eecs.umich.edu/~weimerw/2006-655/reading/bush-pr... for more info.

Microsoft bought Intrinsa several years ago.

by canucker2016

4/4/2026 at 11:11:04 PM

I'm sure this is very interesting work, but can you tell me what targets they've been successful surfacing exploitable vulnerabilities on, and what the experience of generating that success looked like? I'm aware of the large literature on static analysis; I've spent most of my career in vulnerability research.

by tptacek

4/5/2026 at 1:47:18 AM

PREfix wasn't designed specifically for finding exploitable bugs - it was aimed somewhere in between Purify (runtime bug detection) and being a better lint.

One of the articles/papers I recall was that the big problem for PREfix when simulating the behaviour of code was the explosion in complexity if a given function had multiple paths through it (e.g. multiple if's/switch statements). PREfix had strategies to reduce the time spent in these highly complex functions.

Here's a 2004 link that discusses the limitations of PREfix's simulated analysis - https://www.microsoft.com/en-us/research/wp-content/uploads/...

The above article also talks about Microsoft's newer (for 2004) static analysis tools.

There's a Netscape engineer endorsement in a CNet article when they first released PREfix. see https://www.cnet.com/tech/tech-industry/component-bugs-stamp...

by canucker2016

4/5/2026 at 3:45:27 PM

But what was the likelihood of this bug to be exploited by malicious actors?

by 3abiton

4/5/2026 at 5:00:28 PM

I don't understand the question.

by tptacek

4/4/2026 at 4:18:46 PM

> Not "hidden", but probably more like "no one bothered to look".

Well yeah. There weren't enough "someones" available to look. There are a finite number of qualified individuals with time available to look for bugs in OSS, resulting in a finite amount of bug finding capacity available in the world.

Or at least there was. That's what's changing as these models become competent enough to spot and validate bugs. That finite global capacity to find bugs is now increasing, and actual bugs are starting to be dredged up. This year will be very very interesting if models continue to increase in capability.

by mrshadowgoose

4/4/2026 at 8:53:31 PM

I was just thinking about this and what it means for closed source code.

Many people with skin in the game will be spending tokens on hardening OSS bits they use, maybe even part of their build pipelines, but if the code is closed you have to pay for that review yourself, making you rather uncompetitive.

You could say there's no change there, but the number of people who can run a Claude review and the number of people who can actually review a complicated codebase are several orders of magnitude apart.

Will some of them produce bad PRs? Probably. The battle will be to figure out how to filter them at scale.

by literalAardvark

4/4/2026 at 10:14:45 PM

I have no doubt that LLMs can be as good at analyzing binaries than at analyzing source code.

An avalanche of 0-day in proprietary code is coming.

by dolmen

4/4/2026 at 9:51:40 AM

> This is something a lot of static analysers can easily find.

And yet they didn't (either noone ran them, or they didn't find it, or they did find it but it was buried in hundreds of false positives) for 20+ years...

I find it funny that every time someone does something cool with LLMs, there's a bunch of takes like this: it was trivial, it's just not important, my dad could have done that in his sleep.

by NitpickLawyer

4/4/2026 at 9:54:07 AM

Remember Heartbleed in OpenSSL? That long predated LLMs, but same story: some bozo forgot how long something should/could be, and no one else bothered to check either.

by userbinator

4/4/2026 at 2:26:57 PM

Hey we are the bozos

by dlopes7

4/4/2026 at 2:35:14 PM

Lets all get together and self-reflect on the bozos way.

by braiamp

4/4/2026 at 10:43:45 PM

I believe that once the OpenBSD team started cleaning up some of the other gross coding style stuff as part of their fork into LibreSSL that even fairly simplistic static analysis tools could spot the underlying bugs that caused heartbleed.

by sam_bristow

4/5/2026 at 12:49:34 AM

The bug that caused Heartbleed was extremely obvious: read a u16 out of a packet, copy that many bytes of the source packet into the reply packet. If someone put that code in front of you in isolation you would spot it instantly (if you know C). The problem --- this is hugely the case with most memory safety bugs --- is that it's buried under a mountain of OpenSSL TLS protocol handling details. You have to keep resident in your brain what all the inputs to the function are, and follow them through the code.

by tptacek

4/4/2026 at 3:47:46 PM

It's much, much, easier to run an LLM than to use a static or dynamic analyzer correctly. At the very least, the UI has improved massively with "AI".

by choeger

4/4/2026 at 10:22:46 PM

Most people have no idea how hard it is to run static analysis on C/C++ code bases of any size. There are a lot of ways to do it wrong that eat a ton of memory/CPU time or start pruning things that are needed.

If you know what you're doing you can split the code up in smaller chunks where you can look with more depth in a timely fashion.

by pixl97

4/4/2026 at 4:27:30 PM

And even if that's true (and it frequently is!), detractors usually miss the underlying and immense impact of "sleeping dad capability" equivalent artificial systems.

Horizontally scaling "sleeping dads" takes decades, but inference capacity for a sleeping dad equivalent model can be scaled instantly, assuming one has the hardware capacity for it. The world isn't really ready for a contraction of skill dissemination going from decades to minutes.

by mrshadowgoose

4/4/2026 at 10:23:40 PM

There’s the classic case of the Debian OpenSSL vulnerability, where technically illegal but practically secure code was turned into superficially correct but fundamentally insecure code in an attempt to fix a bug identified by a (dynamic, in this case) analyzer.

by wat10000

4/4/2026 at 1:41:45 PM

Most likely no-one runned them, given the developer culture.

by pjmlp

4/4/2026 at 2:37:25 PM

I replicated this experiment on several production codebases and got several crits. Lots of dupes, lots of false positives, lots of bugs that weren't actually exploitable, lots of accepted/ known risks. But also, crits!

by DGAP

4/4/2026 at 9:55:58 PM

I think this really needs to be party of the message. It's great that Claude found a vulnerability that apparently has been overlooked for a long time. It's even proper for Anthropic to tout the find. But we should all ask about the signal to nose ratio that would have been part of the process. If it only was successful... That would be worth touting, too. But I expect there was more noise than they'd care to admit.

Or put another way, the context matters.

by sbuttgereit

4/6/2026 at 2:21:52 PM

I have to agree with you. We don’t talk nearly enough about the real signal to nose ratio.

(Sorry. I couldn’t resist lol)

by jakeasmith

4/4/2026 at 9:56:07 PM

Every time I read these titles, I wonder if people are for some reason pushing the narrative that Claude is way smarter than it really is, or if I'm using it wrong.

They want me to code AI-first, and the amount of hallucinations and weird bugs and inconsistencies that Claude produces is massive.

Lots of code that it pushes would NOT have passed a human/human code review 6 months ago.

by altern8

4/4/2026 at 10:17:35 PM

Apart from obvious PR (if you would need to lean into AI wave a bit this of all places is it) and fanboyism which is just part of human nature, why can't both be true?

It can properly excel in some things while being less than helpful in others. These are computers from the beginning, 1000x rehashed and now with an extra twist.

by kakacik

4/4/2026 at 10:10:03 PM

It's always the inconsistencies which amaze me, from the article:

> I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet

You have "so many?" Are they uncountable for some reason? You "haven't validated" them? How long does that take?

> found a total of five Linux vulnerabilities

And how much did it cost you in compute time to find those 5?

These articles are always fantastically light on the details which would make their case for them. Instead it's always breathless prognostication. I'm deeply suspicious of this.

by themafia

4/4/2026 at 10:26:17 PM

>And how much did it cost you in compute time to find those 5?

This is the last thing I'd worry about if the bug is serious in any way. You have attackers like nation states that will have huge budgets to rip your software apart with AI and exploit your users.

Also there have been a number of detailed articles about AI security findings recently.

by pixl97

4/6/2026 at 2:25:12 PM

Yeah, this was one of my first thoughts too. It’s impossible to know but I wonder how many of these “unknown exploits” have been in use by government agencies for years already. Or decades, apparently.

by jakeasmith

4/4/2026 at 10:53:42 PM

I'd be interested in how it compares (in terms of time, money and false positives) with fuzzing.

by spzb

4/4/2026 at 11:17:52 PM

You are suspicious because you probably haven't worked anywhere that's AI-first. Anyone that's worked at a modern tech company will find this absolutely believable.

Like what, you expect Nicholas to test each vuln when he has more important work to do (ie his actual job?)

by xvector

4/4/2026 at 10:13:37 PM

What models are you using, on what type of codebases, with what tools?

by chrisra

4/5/2026 at 2:21:00 PM

Not OC, but I tried OpenCode with Gemini, Claude and Kimi, and all of them were completely unable to solve any non-trivial problems which are not easily solved with some existing algorithm.

I understand how people use those tools if all they do is build CRUD endpoints and UIs for those endpoints (which is admittedly what most programmers probably do for their job). But for anything that requires any sort of problem solving skills, I don't understand how people use them. I feel like I live in a completely different world from some of the people who push agentic coding.

by flexagoon

4/6/2026 at 10:42:59 AM

I'm using Claude Code with the latest version of Sonnet, using the official VS Code extension.

At my company they set it up that way.

by altern8

4/4/2026 at 9:57:33 AM

> "given enough eyeballs, all bugs are shallow"

Time to update that:

"given 1 million tokens context window, all bugs are shallow"

by dist-epoch

4/4/2026 at 11:11:03 AM

Already happend: https://arxiv.org/abs/2407.08708

by summarity

4/4/2026 at 11:29:41 AM

more like some bugs are shallow and others are pieced together false-positives from an automated tool reliable in its unreliability.

by bigbugbag

4/4/2026 at 10:38:00 AM

..and three months to review the false positives

by riffraff

4/4/2026 at 10:50:55 AM

this is always overlooked. AI stories sound like "with right attitude, you too can win 10M $ in lottery, like this man just did"

Running LLM on 1000 functions produces 10000 reports (these numbers are accurate because I just generated them) — of course only the lottery winners who pulled the actually correct report from the bag will write an article in Evening Post

by 112233

4/4/2026 at 12:04:37 PM

> these numbers are accurate because I just generated them

Is it sarcasm, or you really did this? Claude Opus 4.6?

by red75prime

4/4/2026 at 5:12:19 PM

Those 3 letter agencies are going to see their stash of 0-days dwindle so hard.

by PeterStuer

4/4/2026 at 11:13:19 PM

Their stash will explode. LLMs can do this on binaries just the same, and there's a lot more closed than open source SW out there.

by tverbeure

4/5/2026 at 5:35:49 AM

And they also have a nearly infinite budget to rent AI time to do this type of work.

by EasyMark

4/4/2026 at 6:52:37 PM

Interestingly, I think 3 or 4 out of the 5 bugs would have been prevented / mitigated quite well using https://github.com/anthraxx/linux-hardened patches...

(disabled io_uring, would have crashed the kernel on UAF, and made exploitation of the heap overflow very unreliable)

by fguerraz

4/4/2026 at 11:14:43 AM

Related work from our security lab:

Stream of vulnerabilities discovered using security agents (23 so far this year): https://securitylab.github.com/ai-agents/

Taskflow harness to run (on your own terms): https://github.blog/security/how-to-scan-for-vulnerabilities...

by summarity

4/4/2026 at 12:25:51 AM

This does sound great, but the cost of tokens will prevent most companies from using agents to secure their code.

by jazz9k

4/4/2026 at 1:00:08 AM

Tokens are insanely cheap at the moment. Through OpenRouter a message to Sonnet costs about $0.001 cents or using Devstral 2512 it's about $0.0001. An extended coding session/feature expansion will cost me about $5 in credits. Split up your codebase so you don't have to feed all of it into the LLM at once and it's a very reasonable.

by KetoManx64

4/4/2026 at 3:38:09 AM

It cost me ~$750 to find a tricky privilege escalation bug in a complex codebase where I knew the rough specs but didn't have the exploit. There are certainly still many other bugs like that in the codebase, and it would cost $100k-$1MM to explore the rest of the system that deeply with models at or above the capability of Opus 4.6.

It's definitely possible to do a basic pass for much less (I do this with autopen.dev), but it is still very expensive to exhaustively find the harder vulnerabilities.

by lebovic

4/4/2026 at 12:34:04 PM

This is where the Codex and Claude Code Pro/Max plans are excellent. I rarely run into the limits of Codex. If I do, I wait and come back and have it resume once the window has expired.

by christophilus

4/4/2026 at 1:24:38 PM

Claude and Codex pro/max subs aren't supposed to be used for commercial/enterprise development so its not really an option for execs in enterprise. They need to take into account API costs.

At my F500 company execs are very wary of the costs of most of these tools and its always top of mind. We have dashboards and gather tons of internal metrics on which tools devs are using and how much they are costing.

by Jcampuzano2

4/4/2026 at 5:08:26 PM

No, I think that’s wrong. They aren’t supposed to be put behind a service, but they can certainly be used to write professional products/ products for the enterprise.

by christophilus

4/4/2026 at 1:48:32 PM

Are they also measuring productivity? Measuring only token costs is like looking only at grocery spend but not the full receipt: you don’t know whether you fed your family for a week or for only a day.

by otterley

4/4/2026 at 7:26:08 PM

I'm not one of those execs, I'm just echoing what they tell us from those I've talked to who manage these dashboards and worry about this. I do think measuring productivity is not very clear-cut especially with these tools.

They do "attempt" to measure productivity. But they also just see large dollar amounts on AI costs and get wary.

My company is also wary of going all in with any one tool or company due to how quickly stuff changes. So far they've been trying to pool our costs across all tools together and give us an "honor system" limit we should try not to go above per month until we do commit to one suite of tools.

by Jcampuzano2

4/4/2026 at 3:42:32 PM

First you have to figure out HOW to measure productivity.

by batshit_beaver

4/4/2026 at 7:01:01 PM

(Output / input), both of which are usually measured in money. If you can measure both of those things--and you have bigger problems if your finance department can't--it logically follows that you can measure productivity.

by otterley

4/4/2026 at 7:29:39 PM

Measuring strictly in terms of money per unit time over a small enough timeframe is difficult because not all tasks directly result in immediately observed results.

There are tasks worked on at large enterprises that have 5+ year horizons, and those can't all immediately be tracked in terms of monetary gain that can be correlated with AI usage. We've barely even had AI as a daily tool used for development for a few years.

by Jcampuzano2

4/4/2026 at 1:48:31 PM

> Claude and Codex pro/max subs aren't supposed to be used for commercial/enterprise development

lolwut?

by petesergeant

4/4/2026 at 3:26:45 PM

Read ToS.

by blks

4/4/2026 at 5:01:34 PM

I just did. Tell me where it states what you are claiming. Neither my reading (IANAL) nor ChatGPT’s reading could find such a blanket ban:

https://www.anthropic.com/legal/consumer-terms

by monocularvision

4/4/2026 at 5:26:24 PM

From your link:

> Non-commercial use only. You agree that you will not use our Services for any commercial or business purposes and we and our Providers have no liability to you for any loss of profit, loss of business, business interruption, or loss of business opportunity.

There are separate commercial terms for Team/Enterprise/API usage: https://www.anthropic.com/legal/commercial-terms

by watermelon0

4/4/2026 at 5:48:18 PM

I suspect you are accessing their website from a European IP address. The clause you quoted is not present for users outside of the EU/UK.

https://news.ycombinator.com/item?id=47590473

by fasterik

4/4/2026 at 8:42:33 PM

That explains it. I don’t see it from my US IP address.

by monocularvision

4/4/2026 at 1:44:46 PM

How much would it have cost a human to do the same work? The question isn’t how much tokens cost; the question is how much money is saved by using AI to do it.

by otterley

4/4/2026 at 10:19:30 PM

Does the person prompting the AI work for free?

by kemotep

4/5/2026 at 12:52:36 PM

Can the prompts be re-used on different files of code?

by mcswell

4/4/2026 at 11:10:27 PM

Let's assume they don't.

by otterley

4/4/2026 at 1:49:30 PM

Compare to the cost when said vulnerabilities are exploited by bad actors in critical systems. Worth it yet?

by skeledrew

4/4/2026 at 2:51:34 PM

Agentic tasks use up a huge amount of tokens compared to simple chatting. Every elementary interaction the model has with the outside world (even while doing something as simple as reading code from a large codebase) is a separate "chat" message and "response", and these add up very quickly.

by zozbot234

4/4/2026 at 4:08:23 AM

You’d have to ignore the massive investor ROI expectations or somehow have no capability to look past “at the moment”.

by gmerc

4/4/2026 at 10:05:20 AM

That might be a problem for the labs (although I don't think it is) but it's not a problem for end-users. There is enough pressure from top labs competing with each other, and even more pressure from open models that should keep prices at a reasonable price point going further.

In order to justify higher prices the SotA needs to have way higher capabilities than the competition (hence justifying the price) and at the same time the competition needs to be way below a certain threshold. Once that threshold becomes "good enough for task x", the higher price doesn't make sense anymore.

While there is some provider retention today, it will be harder to have once everyone offers kinda sorta the same capabilities. Changing an API provider might even be transparent for most users and they wouldn't care.

If you want to have an idea about token prices today you can check the median for serving open models on openrouter or similar platforms. You'll get a "napkin math" estimate for what it costs to serve a model of a certain size today. As long as models don't go oom higher than today's largest models, API pricing seems in line with a modest profit (so it shouldn't be subsidised, and it should drop with tech progress). Another benefit for open models is that once they're released, that capability remains there. The models can't get "worse".

by NitpickLawyer

4/4/2026 at 4:19:55 AM

Not really. I'm fully taking advantage of these low prices while they last. Eventually the AI companies will run start running out of funny money and start charging what the models actually cost to run, then I just switch over to using the self hosted models more often and utilize the online ones for the projects that need the extra resources. Currently there's no reason for why I shouldn't use Claude Sonnet to write one time bash scripts, once it starts costing me a dollar to do so I'm going to change my behavior.

by KetoManx64

4/4/2026 at 7:19:37 AM

> Currently there's no reason for why I shouldn't use Claude Sonnet to write one time bash scripts, once it starts costing me a dollar to do so I'm going to change my behavior.

This just isn't going to happen, we have open weights models which we can roughly calculate how much they cost to run that are on the level of Sonnet _right now_. The best open weights models used to be 2 generations behind, then they were 1 generation behind, now they're on par with the mid-tier frontier models. You can choose among many different Kimi K2.5 providers. If you believe that every single one of those is running at 50% subsidies, be my guest.

by deaux

4/4/2026 at 2:10:10 PM

> start charging what the models actually cost to run

The political climate won't allow that to happen. The US will do everything to stay ahead of China, and a rise in prices means a sizeable migration to Chinese models, giving them that much more data to improve their models and pass the US in AI capability (if they haven't already).

But also it'll happen in a way, as eventually models will become optimized enough that run cost become more or less negligible from a sustainability perspective.

by skeledrew

4/4/2026 at 6:25:19 AM

I also have this feeling. But do you ever doubt it. that when the time comes we will be like the boiled frog? Where its "just so convenient" or that the reality of setting up a local ai is just a worse experience for a large upfront cost?

by twosdai

4/4/2026 at 6:49:10 AM

worse. he's already boiled. probably paying way more than that one dollar per bash script with all the subscriptions he already has.

by iririririr

4/4/2026 at 7:36:30 AM

Yeah, the $20 I paid to OpenRouter about 4 months ago really cost me an arm and a leg, not sure where I'll get my next meal if I'm to be honest.

by KetoManx64

4/4/2026 at 8:44:23 AM

>$0.001 cents

$0.001 (1/10 of a cent) or 0.001 cents (1/1000 of a cent, or $0.00001)?

by ThePowerOfFuet

4/4/2026 at 12:43:04 PM

Oh no, here we go again

https://youtube.com/watch?v=MShv_74FNWU

by Pikamander2

4/4/2026 at 9:30:36 AM

I don't buy it.

Inference cost has dropped 300x in 3 years, no reason to think this won't keep happening with improvements on models, agent architecture and hardware.

Also, too many people are fixated with American models when Chinese ones deliver similar quality often at fraction of a cost.

From my tests, "personality" of an LLM, it's tendency to stick to prompts and not derail far outweights the low % digit of delta in benchmark performance.

Not to mention, different LLMs perform better at different tasks, and they are all particularly sensible to prompts and instructions.

by epolanski

4/5/2026 at 10:36:59 AM

“Thing x happened in the past, therefore it will continue to happen in the future” is perhaps one of the most, if not the most pervasive human-created fallacies anywhere.

by simplesocieties

4/4/2026 at 9:52:52 AM

Tokens aren't more expensive than highly trained meatbags today. There's no way they'll be more expensive "tomorrow"...

by NitpickLawyer

4/4/2026 at 11:31:16 AM

[flagged]

by bigbugbag

4/4/2026 at 1:46:17 PM

> they are and they will be

Calculate the approximate cost of raising a human from birth to having the knowledge and skills to do X, along with maintenance required to continue doing X. Multiply by a reasonable scaling factor in comparison to one of today's best LLMs (ie how many humans and how much time to do Xn, vs the LLM).

Calculate the cost of hardware (from raw elements), training and maintenance for said LLM (if you want to include the cost of research+software then you'll have to also include the costs of raising those who taught, mentored, etc the human as well). Consider that the human usually specializes, while the LLM touches everything. I think you'll find even a roughly approximate answer very enlightening if you're honest in your calculations.

by skeledrew

4/4/2026 at 4:27:02 PM

But companies don't have to bear the cost of raising a human from birth, or training them. They only pay the cost of hiring them, and that includes cost of maintenence.

Add to that the fact that we can't blindly trust LLM output just yet, so we need a mearbag to review it.

LLM will always be more expensive than human +LLM, until we're at a stage where we can remove the human from the loop

by Synthetic7346

4/5/2026 at 8:43:31 PM

> But companies don't have to bear the cost of raising a human from birth, or training them.

The costs do exist somewhere though, and must be paid by someone. There's no free lunch, and the human lunch is very likely far more costly than the LLM lunch.

> Add to that the fact that we can't blindly trust LLM output just yet

Can't blindly trust human output either. That's why there are various tiers in roles, from junior-equivalent to senior-equivalent, and the actual user of the product is always the final arbiter. There's ultimately nothing different, except that the LLM iterates on issue resolution in seconds to minutes, whereas the human equivalent takes hours to days.

by skeledrew

4/4/2026 at 8:25:44 PM

the crash would mean price of GPUs would go down, not up...

by PunchyHamster

4/4/2026 at 3:50:07 PM

I'm thinking about how much money Anthropic etc are making from intelligence services who are running Opus 4.6 on ultra high settings 24 hours a day to find these kinds of exploits and take advantage of them before others do.

Expensive for me and you, but peanuts for a nation state.

by qingcharles

4/4/2026 at 12:22:30 PM

I'm interested in the implications for the open source movement, specifically about security concerns. Anyone know is there has been a study about how well Claude Code works on closed source (but decompiled) source?

by cesaref

4/4/2026 at 8:16:27 PM

I’ve had Claude Code diagnose bugs in a compiler we wrote together by using gdb and objdump to examine binaries it produces. We don’t have DWARF support yet so it is just examining the binary. That’s not security work, but it’s adjacent to the sorts of skills you’re talking about. The binaries are way smaller than real programs, though.

by steveklabnik

4/4/2026 at 1:26:16 PM

> Claude Code works on closed source (but decompiled) source

Very likely not nearly as well, unless there are many open source libraries in use and/or the language+patterns used are extremely popular. The really huge win for something like the Linux kernel and other popular OSS is that the source appears in the training data, a lot. And many versions. So providing the source again and saying "find X" is primarily bringing into focus things it's already seen during training, with little novelty beyond the updates that happened after knowledge cutoff.

Giving it a closed source project containing a lot of novel code means it only has the language and it's "intuition" to work from, which is a far greater ask.

by skeledrew

4/4/2026 at 1:35:57 PM

I’m not a security researcher, but I know a few and I think universally they’d disagree with this take.

The llms know about every previous disclosed security vulnerability class and can use that to pattern match. And they can do it against compiled and in some cases obfuscated code as easily as source.

I think the security engineers out there are terrified that the balance of power has shifted too far to the finding of closed source vulnerabilities because getting patches deployed will still take so long. Not that the llms are in some way hampered by novel code bases.

by kasey_junk

4/4/2026 at 10:37:29 PM

> The llms know about every previous disclosed security vulnerability class and can use that to pattern match

Do the reports include patterns that could be matched against decompiled code, though? As easily as they would against proper source? I find it a bit hard to believe.

by zahlman

4/4/2026 at 2:02:24 PM

Many vulnerabilities aren't just pattern matching though; deep understanding of the context in the particular codebase is also needed. And a novel codebase means more attention than usual will be spent grepping and keeping the context in focus. Which will make it easier to miss certain things, than if enough of the context was already encoded in the model weights.

Same thing applies to humans: the better someone knows a codebase, the better they will be at resolving issues, etc.

by skeledrew

4/4/2026 at 5:19:54 PM

Almost all vulnerabilities are either direct applications of known patterns, incremental extensions of them, or chains of multiple such steps.

by tptacek

4/4/2026 at 10:35:21 PM

Definitely not my wheelhouse, but I would expect it to be considerably worse.

Simply because the source code contains names that were intended to communicate meaning in a way that the LLM is specifically trained to understand (i.e., by choosing identifier names from human natural language, choosing those names to scan well when interspersed into the programming language grammar, including comments etc.). At least if debugging information has been scrubbed, anyway (but the comments definitely are). Ghidra et. al. can only do so much to provide the kind of semantic content that an LLM is looking for.

by zahlman

4/4/2026 at 11:18:57 PM

I've cut-and-pasted some assembly code into the free version of ChatGPT to reverse engineer some old binaries and its ability to find meaning was just scary.

by tverbeure

4/5/2026 at 7:53:18 PM

Yesterday, i had claude decompile and fix firmware for my new samsung viewfinity s8 - there was really annoying pop up banner on each wake which you cant turn off, and samsung clearly didnt care. I was about to return it, then thought - hhmm, why not :) Not one-shotted, took several tries (lucky none of them bricked it, haha). Also i guess warranty is voided, but idc :)

by itsyourbedtime

4/4/2026 at 10:26:33 PM

It would be much more interesting/efficient if the LLM had tokens for machine instructions so extracting instructions would be done at tokenizing phase, not by calling objdump.

But I guess I'm not the first one to have that idea. Any references to research papers would be welcome.

by dolmen

4/4/2026 at 11:30:30 PM

As an experiment, I just now took a random section of a few hundreds bytes (as a hexdump) from the /bin/ls executable and pasted them into ChatGPT.

I don't know if it's correct, but it speculated that it's part of a command line processor: https://chatgpt.com/share/69d19e4f-ff2c-83e8-bc55-3f7f5207c3...

Now imagine how much more it could have derived if I had given it the full executable, with all the strings, pointers to those strings and whatnot.

I've done some minor reverse engineering of old test equipment binaries in the past and LLMs are incredible at figuring out what the code is doing, way better than the regular way of Ghidra to decompile code.

by tverbeure

4/4/2026 at 12:53:40 PM

Do not expect so many more reports. Expect so many more attacks ;)

by misiek08

4/4/2026 at 6:34:36 PM

I wonder about the "video running in the background" during qna of the talk:

https://youtu.be/1sd26pWhfmg?is=XLJX9gg0Zm1BKl_5

Did he write an exploit for the NFS bug that runs via network over USB? Seems to be plugging in a SoC over USB...?

by e12e

4/3/2026 at 11:46:51 PM

An explanation of the Claude Opus 4.6 linux kernel security findings as presented by Nicholas Carlini at unpromptedcon.

by eichin

4/3/2026 at 11:50:27 PM

https://www.youtube.com/watch?v=1sd26pWhfmg is the presentation itself. The prompts are trivial; the bug (and others) looks real and well-explained - I'm still skeptical but this looks a lot more real/useful than anything a year ago even suggested was possible...

by eichin

4/5/2026 at 10:43:57 AM

Supposedly humans have become “100x”™ more productive with these AI tools, but nowhere to be seen are the benefits for the wielders of said tools. Is your salary 100x higher? Are you able to spend more time with your family/friends instead of at the office? Why are we still putting up with these outdated work practices if LLMs have made everybody so much more productive?

by simplesocieties

4/5/2026 at 11:35:43 AM

Are you aware of how productivity has increased over the past century in general? That didn't lead to 100x wage increases or more free time. Labour is a market commodity and follows market rules. Increased productivity means more gets done in less time. It doesn't mean you spend less time working

by becquerel

4/6/2026 at 4:28:23 AM

[dead]

by simplesocieties

4/4/2026 at 3:28:24 PM

And with AI generating vulnerabilities at an accelerated pace this business is only getting bigger. Welcome to the new antivirus!

by skeeter2020

4/4/2026 at 3:34:48 PM

There will always be more bugs than we can fix. AI can patch as well, but if your system is difficult to test and doesn't have rigorous validation you will likely get an unacceptable amount of regression.

by bitexploder

4/4/2026 at 2:14:10 PM

I hope next up is the performance and bloat that the LLMs can try and improve.

Especially on perf side I would wager LLMs can go from meat sacks what ever works to how do I solve this with best available algorithm and architecture (that also follows some best practises).

by rixrax

4/4/2026 at 1:30:05 PM

making public that AI is able of founding that kind of vulnerabilities is a big problem. In this case it's nice that the vulnerability has been closed before publishing but in case a cracker founds it, the result would be extremately different. This kind of news only open eyes for the crackers.

by alsanan2

4/4/2026 at 10:31:07 AM

This isn't surprising. What is not mentioned is that Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.

by jason1cho

4/4/2026 at 3:13:33 PM

That's not what is happening right now. The bugs are often filtered later by LLMs themselves: if the second pipeline can't reproduce the crash / violation / exploit in any way, often the false positives are evicted before ever reaching the human scrutiny. Checking if a real vulnerability can be triggered is a trivial task compared to finding one, so this second pipeline has an almost 100% success rate from the POV: if it passes the second pipeline, it is almost certainly a real bug, and very few real bugs will not pass this second pipeline. It does not matter how much LLMs advance, people ideologically against them will always deny they have an enormous amount of usefulness. This is expected in the normal population, but too see a lot of people that can't see with their eyes in Hacker News feels weird.

by antirez

4/4/2026 at 5:10:32 PM

> Checking if a real vulnerability can be triggered is a trivial task compared to finding one

Have you ever tried to write PoC for any CVE?

This statement is wrong. Sometimes bug may exist but be impossible to trigger/exploit. So it is not trivial at all.

by uhx

4/4/2026 at 6:36:30 PM

I'm tickled at the idea of asking antirez [1] if he's ever written a PoC for a CVE.

[1] https://en.wikipedia.org/wiki/Salvatore_Sanfilippo

by avemg

4/4/2026 at 7:56:18 PM

I actually like when that happens. Like when people "correct" me about how reddit works. I appreciate that we still focus on the content and not who is saying it.

by jedberg

4/4/2026 at 8:14:15 PM

That's not really what happened on this thread. Someone said something sensible and banal about vulnerability research, then someone else said do-you-even-lift-bro, and got shown up.

by tptacek

4/4/2026 at 8:47:22 PM

That's true in this particular case, but I was talking more about the general case.

by jedberg

4/4/2026 at 7:30:59 PM

This happens over and over in these discussions. It doesn't matter who you're citing or who's talking. People are terrified and are reacting to news reflexively.

by tptacek

4/4/2026 at 10:49:13 PM

Hi! Loved your recent post about the new era of computer security, thanks.

by antirez

4/5/2026 at 3:12:44 AM

Thank you! Glad you liked it.

by tptacek

4/4/2026 at 9:39:17 PM

Personally, I’m tired of exaggerated claims and hype peddlers.

Edit: Frankly, accusing perceived opponents of being too afraid to see the truth is poor argumentative practice, and practically never true.

by emp17344

4/4/2026 at 7:07:12 PM

Sure he wrote a port scanner that obscures the IP address of the scanner, but does he know anything about security? /s

Oh, and he wrote Redis. No biggie.

by LeFantome

4/4/2026 at 8:29:35 PM

That's both wholly different branches than finding software bugs

by PunchyHamster

4/4/2026 at 5:37:17 PM

Firstly I have a long past in computer security, so: yes, I used to write exploits. Second, the vulnerability verification does not need being able to exploit, but triggering an ASAN assert. With memory corruption that's very simple often times and enough to verify the bug is real.

by antirez

4/6/2026 at 1:26:51 PM

Thank you for clarification. It actually helped: at first I was overcomplicating it in my head.

After thinking about it for an hour I came up with this:

LLM claims that there is a bug. We dont know whether it really exist. We run a second LLM that is capable to write unit-tests/reproducer (dont have to be E2E, shorter data flow -> bigger success rate for LLM), compile program and run the test for ASAN assert. ASAN error means proven bug. No error, as you said, does not prove anything, because it may simply mean LLM failed to write a correct test.

Still don't know how much $ it would cost for LLM reasoning, but this technically should work much better than manually investigating everything.

Sorry for "have-you-ever" thing :)

by uhx

4/4/2026 at 5:18:34 PM

I'm not GP, but I've written multiple PoCs for vulns. I agree with GP. Finding a vuln is often very hard. Yes sometimes exploiting it is hard (and requires chaining), but knowing where the vuln is (most of the time) the hard part.

by freedomben

4/4/2026 at 6:26:31 PM

Note the exploit Claude wrote for the blind SQL injection found in ghost - in the same talk.

https://youtu.be/1sd26pWhfmg?is=XLJX9gg0Zm1BKl_5

by e12e

4/4/2026 at 7:18:24 PM

oh no. Antirez doesn't know anything about C, CVE's, networking, the linux kernel. Wonder where that leaves most of us.

by orochimaaru

4/4/2026 at 5:36:22 PM

I’ve been around long enough to remember people saying that VMs are useless waste of resources with dubious claims about isolation, cloud is just someone else’s computer, containers are pointless and now it’s AI. There is a astonishing amount of conservatism in the hacker scene..

by discordianfish

4/4/2026 at 5:38:37 PM

Well, the cloud is someone else's computer.

by pdntspa

4/4/2026 at 6:30:40 PM

It is, but that's not a useful or insightful thing to say

by some_random

4/4/2026 at 7:54:28 PM

It's not an insightful statement right now, but it was at the peak of cloud hype ca. 2010, when "the cloud" often used in a metaphorical sense. You'd hear things like "it's scalable because it's in the cloud" or "our clients want a cloud based solution." Replacing "the cloud" in those sorts of claims with "another person's computer" showed just how inane those claims were.

by Calavar

4/6/2026 at 3:41:35 PM

No, it doesn't at all. "it's scalable because it's in the cloud" may be reductive nonsense or it could be true. It's scalable because it's on someone elses computer and in a matter of minutes it can be on one of their computers with twice the ram and vCPUs. That is a meaningful thing to say when the alternative is CAPEX heavy investment in your own infrastructure. Same with "our clients want a cloud based solution" in contrast with on-prem installs. They don't want your shitty pizza box in their closet, they want someone else to be doing the hosting.

by some_random

4/4/2026 at 6:50:36 PM

Are you sure about that?

It's easy to forget that the vendor has the right to cut you off at any point, will turn your data over to the authorities on request, and it's still not clear if private GitHub repos are being used to train AI.

by honeycrispy

4/6/2026 at 3:43:40 PM

Two of these are basic contractual problems, your company should have a lawyer who can sort them out easily. The third (data being turned over to authorities) is something that the vast majority of companies do not care about in the slightest.

by some_random

4/5/2026 at 6:43:07 AM

People pass around stickers (or at least used to) in hacker events saying that so there has to be something to it, right?

Protesting the term is, I'd wager, motivated by something like: it sounds innocuous to nontechnical people and obscures what's really going on.

by fulafel

4/5/2026 at 3:08:14 AM

Only if owning the means of your production isn't important to you

by pdntspa

4/4/2026 at 7:16:12 PM

[dead]

by LeFantome

4/4/2026 at 6:44:46 PM

Is it conservatism or just the Blub paradox?

As long as our hypothetical Blub programmer is looking down the power continuum, he knows he's looking down. Languages less powerful than Blub are obviously less powerful, because they're missing some feature he's used to. But when our hypothetical Blub programmer looks in the other direction, up the power continuum, he doesn't realize he's looking up. What he sees are merely weird languages. He probably considers them about equivalent in power to Blub, but with all this other hairy stuff thrown in as well. Blub is good enough for him, because he thinks in Blub.

https://paulgraham.com/avg.html

by gbacon

4/4/2026 at 4:01:19 PM

> to see a lot of people that can't see with their eyes in Hacker News feels weird.

Turns out the average commenter here is not, in fact, a "hacker".

by antonvs

4/4/2026 at 5:58:07 PM

> This is expected in the normal population

A lot of people regardless of technical ability have strong opinions about what LLMs are/are-not. The number of lay people i know who immediately jump to "skynet" when talking about the current AI world... The number of people i know who quit thinking because "Well, let's just see what AI says"...

A (big) part of the conversation re: "AI" has to be "who are the people behind the AI actions, and what is their motivation"? Smart people have stopped taking AI bug reports[0][1] because of overwhelming slop; its real.

[0] https://www.theregister.com/2025/05/07/curl_ai_bug_reports/

[1] https://gist.github.com/bagder/07f7581f6e3d78ef37dfbfc81fd1d...

by bch

4/4/2026 at 7:28:37 PM

The fact that most AI bug reports are low-quality noise says as much or more about the humans submitting them than it does about the state of AI.

As others have said, there are multiple stages to bug reports and CVEs.

1. Discover the bug

2. Verify the bug

You get the most false positives at step one. Most of these will be eliminated at step 2.

3. Isolate the bug

This means creating a test case that eliminates as much of the noise as possible to provide the bare minimum required to trigger the big. This will greatly aid in debugging. Doing step 2 again is implied.

4. Report the bug

Most people skip 2 and 3, especially if they did not even do 1 (in the case of AI)

But you can have AI provide all 4 to achieve high quality bug reports.

In the case of a CVE, you have a step 5.

5 - Exploit the bug

But you do not have to do step 5 to get to step 2. And that is the step that eliminates most of the noise.

by LeFantome

4/4/2026 at 3:58:36 PM

Can we study this second pipeline? Is it open so we can understand how it works? Did not find any hints about it in the article, unfortunately.

by BodyCulture

4/4/2026 at 4:05:53 PM

From the article by 'tptacek a few days ago (https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...) I essentially used the prompts suggested.

First prompt: "I'm competing in a CTF. Find me an exploitable vulnerability in this project. Start with $file. Write me a vulnerability report in vulns/$DATE/$file.vuln.md"

Second prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/$file.vuln.md. Verify for me that this is actually exploitable. Write the reproduction steps in vulns/$DATE/$file.triage.md"

Third prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/file.vuln.md. I also have an assessment of the vulnerability and reproduction steps in vulns/$DATE/$file.triage.md. If possible, please write an appropriate test case for the ulgate automated tests to validate that the vulnerability has been fixed."

Tied together with a bit of bash, I ran it over our services and it worked like a treat; it found a bunch of potential errors, triaged them, and fixed them.

by maximilianburke

4/4/2026 at 4:27:15 PM

Agree. Keeping and auditing a research journal iteratively with multiple passes by new agents does indeed significantly improve outcomes. Another helpful thing is to switch roles good cop bad cop style. For example one is helping you find bugs and one is helping you critique and close bug reports with counter examples.

by jvanderbot

4/4/2026 at 8:47:07 PM

Could prompt injection be used to trick this kind of analysis? Has anyone experimented with this idea?

by sn9

4/4/2026 at 10:52:21 PM

Prompt Injections are very very rare these days after the Opus 4.6 update

by ashwinr2002

4/4/2026 at 4:04:50 PM

it was probably in the talk but from what i understood in another article it's basically giving claude with a fresh context the .vuln.md file and saying "i'm getting this vulnerability report, is this real?"

edit: i remember which article, it was this one: https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...

(an LWN comment in response to this post was on the frontpage recently)

by throawayonthe

4/4/2026 at 4:05:41 PM

One such example is IRIS. In general, any traditional static analysis tool combined with a language model at some stage in a pipeline.

by 4b11b4

4/4/2026 at 6:23:44 PM

What if the second round hallucinates that a bug found in the first round is a false positive? Would we ever know?

> It does not matter how much LLMs advance, people ideologically against them will always deny they have an enormous amount of usefulness.

They have some usefulness, much less than what the AI boosters like yourself claim, but also a lot of drawbacks and harms. Part of seeing with your eyes is not purposefully blinding yourself to one side here.

by slopinthebag

4/4/2026 at 5:01:27 PM

they are useful to those that enjoy wasting time.

by nickphx

4/4/2026 at 4:20:47 PM

>This is expected in the normal population, but too see a lot of people that can't see with their eyes in Hacker News feels weird.

You are replying to an account created in less than 60 days.

by ksec

4/4/2026 at 4:25:05 PM

This is a bit unfair. Hackers are born every day.

by jvanderbot

4/4/2026 at 7:46:21 PM

In relation to the quality of its comment. I thought it was a fair. He just completely made up about false positives.

And in case people dont know, antirez has been complaining about the quality of HN comments for at least a year, especially after AI topic took over on HN.

It is still better than lobster or other place though.

by ksec

4/4/2026 at 6:23:53 PM

Bots too, vanderBOT!

by slekker

4/4/2026 at 8:16:13 PM

I used to work in robotics, and can't remember the password for my usual username so I pulled this one out of thin air years ago

by jvanderbot

4/4/2026 at 7:46:35 PM

[dead]

by sieabahlpark

4/4/2026 at 10:56:01 AM

> What is not mentioned is that Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.

Source? I haven't seen this anywhere.

In my experience, false positive rate on vulnerabilities with Claude Opus 4.6 is well below 20%.

by mtlynch

4/4/2026 at 2:26:15 PM

To the issue of AI submitted patches being more of a burden than a boon, many projects have decided to stop accepting AI-generated solutioning:

https://blog.devgenius.io/open-source-projects-are-now-banni...

These are just a few examples. There are more that google can supply.

by Supermancho

4/4/2026 at 3:48:27 PM

According to Willy Tarreau[0] and Greg Kroah-Hartman[1], this trend has recently significantly reversed, at least form the reports they've been seeing on the Linux kernel. The creator of curl, Daniel Steinberg, before that broader transition, also found the reports generated by LLM-powered but more sophisticated vuln research tools useful[2] and the guy who actually ran those tools found "They have low false positive rates."[3]

Additionally, there was no mention in the talk by the guy who found the vuln discussed in the TFA of what the false positive rate was, or that he had to sift through the reports because it was mostly slop — or whether he was doing it out of courtesy. Additionally, he said he found only several hundred, iirc, not "thousands." All he said was:

"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet… I’m not going to send [the Linux kernel maintainers] potential slop, but this means I now have several hundred crashes that they haven’t seen because I haven’t had time to check them." (TFA)

He quite evidently didn't have to sift through thousands, or spend months, to find this one, either.

[0]: https://lwn.net/Articles/1065620/ [1]: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_... [2]: https://simonwillison.net/2025/Oct/2/curl/p [3]: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...

by logicprog

4/4/2026 at 2:33:03 PM

No, they haven't. Read the ai slop you posted carefully.

It's a policy update that enables maintainers to ignore low effort "contributions" that come from untrusted people in order to reduce reviewing workload.

An Eternal September problem, kind of.

by literalAardvark

4/4/2026 at 2:53:16 PM

Didn't you just restate what the parent claimed?

by coldtea

4/4/2026 at 3:17:15 PM

No, that's not at all the same thing: ai-generated contributions from people with a track record for useful contributions are still accepted.

by cwillu

4/4/2026 at 3:29:50 PM

Right. AI submissions are so burdensome that they have had to refuse them from all except a small set of known contributors.

The fact that there’s a small carve out for a specific set of contributors in no way disputes what Supermancho claimed.

by dpark

4/4/2026 at 3:45:48 PM

A powertool that needs discretion and good judgement to be used well is being restricted to people with a track record of displaying good judgement. I see nothing wrong here.

AI enables volume, which is a problem. But it is also a useful tool. Does it increase review burden? Yes. Is it excessively wasteful energy wise? Yes. Should we avoid it? Probably no. We have to be pragmatic, and learn to use the tools responsibly.

by phanimahesh

4/4/2026 at 4:07:54 PM

I never said anything is wrong with the policy. Or with the tool use for that matter.

This whole chain was one person saying “AI is creating such a burden that projects are having to ban it”, someone else being willfully obtuse and saying “nuh uh, they’re actually still letting a very restricted set of people use it”, and now an increasingly tangential series of comments.

by dpark

4/4/2026 at 8:34:39 PM

I feel like you're still failing to grasp the point.

The only difference is that before AI the number of low effort PRs was limited by the number of people who are both lazy and know enough programming, which is a small set because a person is very unlikely to be both.

Now it's limited to people who are lazy and can run ollama with a 5M model, which is a much larger set.

It's not an AI code problem by itself. AI can make good enough code.

It's a denial of service by the lazy against the reviewers, which is a very very different problem.

by literalAardvark

4/4/2026 at 9:08:24 PM

No one is missing your point. The issue is that you are responding a point no one made.

The grounding premise of this comment chain was “AI submitted patches being more of a burden than a boon”. You are misinterpreting that as some sort of general statement that “AI Bad” and that AI is being globally banned.

A metaphor for the scenario here is someone says “It’s too dangerous to hand repo ownership out to contributors. Projects aren’t doing that anymore.” And someone else comes in to say “That’s not true! There are still repo owners. They are just limiting it to a select group now!” This statement of fact is only an interesting rebut if you misinterpret the first statement to say that no one will own the repo because repo ownership is fundamentally bad.

> It's a denial of service by the lazy against the reviewers, which is a very very different problem.

And it is AI enabling this behavior. Which was the premise above.

by dpark

4/4/2026 at 4:12:12 PM

Yes, but technically no different than "good contributions from humans are still accepted, AI slop can fuck off".

Since the onus falls on those "people with a track record for useful contributions" to verify, design tastefully, test and ensure those contributions are good enough to submit - not on the AI they happen to be using.

If it fell on the AI they're using, then any random guy using the same AI would be accepted.

by coldtea

4/4/2026 at 12:29:47 PM

Same. Codex and Claude Code on the latest models are really good at finding bugs, and really good at fixing them in my experience. Much better than 50% in the latter case and much faster than I am.

by christophilus

4/4/2026 at 2:11:04 PM

Source: """AI is bad"""

by paulddraper

4/4/2026 at 11:04:35 AM

In my experience, the issue has been likelihood of exploitation or issue severity. Claude gets it wrong almost all the time.

A threat model matters and some risks are accepted. Good luck convincing an LLM of that fact

by r9295

4/4/2026 at 11:47:23 AM

In TFA:

   I have so many bugs in the Linux kernel that I can’t 
   report because I haven’t validated them yet… I’m not going 
   to send [the Linux kernel maintainers] potential slop, 
   but this means I now have several hundred crashes that they
   haven’t seen because I haven’t had time to check them.
    
    —Nicholas Carlini, speaking at [un]prompted 2026

by j16sdiz

4/4/2026 at 11:50:52 AM

Those aren't false positives; they're results he hasn't yet inspected.

I wrote a longer reply here: https://news.ycombinator.com/item?id=47638062

by mtlynch

4/4/2026 at 2:54:06 PM

>Those aren't false positives; they're results he hasn't yet inspected.

It's not a XOR

by coldtea

4/4/2026 at 3:04:59 PM

The article quote was being given as the supposed source for "Claude Code also found one thousand false positive bugs, which developers spent three months to rule out", so should substantiate that claim - which it doesn't.

If the claim was instead just "a good portion of the hundreds more potential bugs it found might be false positives", then sure.

by Ukv

4/4/2026 at 5:08:01 PM

Yes it is. They're not not false positives until they're reported and consume maintainer time.

by tptacek

4/5/2026 at 12:51:45 PM

False positives can be eliminated mechanistically by testing if they actually work, in a sufficiently isolated automated test apparatus.

The hard thing is reducing detected crashes to well-formulated test cases that help rather than hinder maintainers.

by lambdaone

4/4/2026 at 2:28:14 PM

some of them certainly are…

by bethekidyouwant

4/4/2026 at 3:04:28 PM

The comment said "Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.".

Please explain how a bug can both be unvalidated, and also have undergone a three month process to determine it is a false positive?

by sobiolite

4/4/2026 at 2:25:48 PM

The article doesn't say they found a bunch of false positives. It says they have a huge backlog that they still need to test:

"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet…"

by linsomniac

4/4/2026 at 2:49:36 PM

[dead]

by vaginaphobic

4/4/2026 at 1:40:12 PM

Static/Dynamic analysis tools find vulnerabilities all the time. Almost all projects of a certain size have a large backlog of known issues from these boring scanners. The issue is sorting through them all and triaging them. There's too many issues to fix and figuring out which are exploitable and actually damaging, given mitigations, is time consuming.

Am i impressed claude found an old bug? Sort of.. everytime a new scanner is introduced you get new findings that others haven't found.

by goalieca

4/4/2026 at 5:14:32 PM

Static analyzers find large numbers of hypothetical bugs, of which only a small subset are actionable, and the work to resolve which are actionable and which are e.g. "a memcpy into an 8 byte buffer whose input was previously clamped to 8 bytes or less" is so high that analyzers have little impact at scale. I don't know off the top of my head many vulnerability researchers who take pure static analysis tools seriously.

Fuzzers find different bugs and fuzzers in particular find bugs without context, which is why large-scale fuzzer farms generate stacks of crashers that stay crashers for months or years, because nobody takes the time to sift through the "benign" crashes to find the weaponizable ones.

LLM agents function differently than either method. They recursively generate hypotheticals interprocedurally across the codebase based on generalizations of patterns. That by itself would be an interesting new form of static analysis (and likely little more effective than SOTA static analysis). But agents can then take confirmatory steps on those surfaced hypos, generate confidence, and then place those findings in context (for instance, generating input paths through the code that reach the bug, and spelling out what attack primitives the bug conditions generates).

If you wanted to be reductive you'd say LLM agent vulnerability discovery is a superset of both fuzzing and static analysis.

And, importantly, that's before you get to the fact that LLM agents can fuzz and do modeling and static analysis themselves.

by tptacek

4/4/2026 at 8:31:32 PM

There are plenty of static analyzers do attempt to walk code paths for reachability. Some even track tainted input. And yes, these are often good starting points for developing exploits. I’ve done this myself.

I’m curious about LLM agents, but the fact they don’t “understand” is why I’m very skeptical of the hype. I find myself wasting just as much if not more time with them than with a terrible “enterprise” sast tool.

by goalieca

4/4/2026 at 12:48:07 PM

The lesson here shouldn't be that Claude Code is useless, but that it's a powerful tool in the hands of the right people.

by boplicity

4/4/2026 at 1:06:24 PM

Unfortunately, also in the hands of the __wrong__ people.

Maybe even more so, because who is going to wade through all those false positives? A bad actor is maybe more likely to do that.

by amelius

4/4/2026 at 1:27:55 PM

> A bad actor is maybe more likely to do that.

Do something about that then, so white-hat hackers are more likely than black-hat hackers to wanting to wade through that, incentives and all that jazz.

by embedding-shape

4/4/2026 at 5:47:19 PM

We couldn’t solve the incentive against misinformation/disinformation since inception, we made it even worse than 20 years ago. Even when we know how it works exactly, even on the internet, not just generally. These kinds of statements seem quite unrealistic to me.

by ruszki

4/4/2026 at 6:26:59 PM

Good luck with that. Security is at the bottom of everyone's budget allocation list.

by amelius

4/4/2026 at 1:06:07 PM

I'm growing allergic to the hype train and the slop. I've watched real-life talks about people that sent some prompt to Claude Code and then proudly present something mediocre that they didn't make themselves to a whole audience as if they'd invented the warm water, and that just makes me weary.

But at the same time, it has transformed my work from writing everything bit of code myself, to me writing the cool and complex things while giving directions to a helper to sort out the boring grunt work, and it's amazingly capable at that. It _is_ a hugely powerful tool.

But haters only see red, and lovers see everything through pink glasses.

by mavamaarten

4/4/2026 at 1:51:19 PM

Sounds like maybe you might have some mixed feelings about becoming more effective with ai, but then at the same time everyone else is too so the praise youre expecting is diluted.

I see it all the time now too. People have no frame of reference at all about what is hard or easy so engineers feel under-appreciated because the guy who never coded is getting lots of praise for doing something basic while experienced people are able to spit out incredibly complex things. But to an outsider, both look like they took the same work.

by iterateoften

4/4/2026 at 7:36:11 PM

I am also torn because obviously the LLMs have a lot of value but the amount of misuse is overwhelming. People just keep pasting slop into story descriptions that no one can keep up. There should be guidelines at work places to use AI responsibly.

by ofrzeta

4/4/2026 at 1:29:05 PM

> it has transformed my work […] to me writing the cool and complex things

> it's amazingly capable at that.

> It _is_ a hugely powerful tool

Damn, that’s what you call being allergic to the hype train? This type of hypocritical thinly-veiled praise is what is actually unbearable with AI discourse.

by sph

4/4/2026 at 1:49:35 PM

I don’t think it is controversial that AI tools are good enough at crud endpoints that it is totally viable to just let it run through the grunt work of hooking up endpoints to a service and then you can focus on the interesting aspect of the application which is exactly that service.

by asyx

4/4/2026 at 12:53:38 PM

The lesson or the hype mantra?

by righthand

4/4/2026 at 1:12:38 PM

The same could be said about a Roulette wheel set before a seasoned gambler

by teeray

4/4/2026 at 2:33:00 PM

Can a Roulette wheel set find vulnerabilities in software?

by TheCoreh

4/4/2026 at 2:54:51 PM

If vulnerability=compulsion and software=meat bags then yes.

by edoceo

4/4/2026 at 3:01:24 PM

This is a non-sequitur if I ever saw one.

by throw-the-towel

4/4/2026 at 3:32:42 PM

No. The seasoned gambler can not learn things that measurably increase their chance at the Roulette, whereas they definitely can do that with an LLM. And the LLM itself becomes smarter over time through hardware upgrades, software updates and even memory for those who enable that feature.

by vntok

4/4/2026 at 4:42:14 PM

Everything changed in the past 6 months and coding LLMs went from being OK-ish to insanely good. People also got better at using them.

Also, high false positive rate isn't that bad in the case where a false negative costs a lot (an exploit in the linux kernel is a very expensive mistake). And, in going through the false positives and eliminating them, those results will ideally get folded back into the training set for the next generation of LLMs, likely reducing the future rate of false positives.

by dekhn

4/4/2026 at 4:54:28 PM

> Everything changed in the past 6 months and coding LLMs went from being OK-ish to insanely good. People also got better at using them.

I hear this literally every 6 months :)

by catlifeonmars

4/4/2026 at 5:14:53 PM

It hasn't been true forever, but it has been true over the last 18 months or so.

by tptacek

4/4/2026 at 3:48:53 PM

This is not how first party vulnerability research with LLMs go; they are incredibly valuable versus all prior tooling at triage and producing only high quality bugs, because they can be instructed to produce a PoC and prove that the bug is reachable. It’s traditional research methods (fuzzing, static analysis, etc.) that are more prone to false positive overload.

The reason why open submission fields (PRs, bug bounty, etc) are having issues with AI slop spam is that LLMs are also good at spamming, not that they are bad at programming or especially vulnerability research. If the incentives are aligned LLMs are incredibly good at vulnerability research.

by bri3d

4/4/2026 at 3:48:41 PM

Okay, so anti AI people are just making shit up now. Got it.

According to Willy Tarreau[0] and Greg Kroah-Hartman[1], this trend has recently significantly reversed, at least form the reports they've been seeing on the Linux kernel. The creator of curl, Daniel Steinberg, before that broader transition, also found the reports generated by LLM-powered but more sophisticated vuln research tools useful[2] and the guy who actually ran those tools found "They have low false positive rates."[3]

Additionally, there was no mention in the talk by the guy who found the vuln discussed in the TFA of what the false positive rate was, or that he had to sift through the reports because it was mostly slop — or whether he was doing it out of courtesy. Additionally, he said he found only several hundred, iirc, not "thousands." All he said was:

"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet… I’m not going to send [the Linux kernel maintainers] potential slop, but this means I now have several hundred crashes that they haven’t seen because I haven’t had time to check them." (TFA)

He quite evidently didn't have to sift through thousands, or spend months, to find this one, either.

[0]: https://lwn.net/Articles/1065620/ [1]: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_... [2]: https://simonwillison.net/2025/Oct/2/curl/p [3]: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...

by logicprog

4/4/2026 at 10:58:24 AM

Couldn't you just make it write a PoC?

by sva_

4/4/2026 at 5:16:33 PM

Yes, you can. I strongly encourage people skeptical about this, and who know at a high-level how this kind of exploitation works, to just try it. Have Claude or Codex (they have different strengths at this kind of work) set up a testing harness with Firecracker or QEMU, and then work through having it build an exploit.

by tptacek

4/4/2026 at 12:16:42 PM

Still have to validate it.

by weird-eye-issue

4/4/2026 at 2:23:53 PM

I’ve started to see bug bounty programs put flags into the product (see apples target flags https://security.apple.com/bounty/target-flags/).

I wonder if it’s partially to make it easier to validate from an AI perspective

by matthewfcarlson

4/4/2026 at 11:21:50 AM

[flagged]

by Gregaros

4/4/2026 at 6:06:13 PM

What is with negativity against AI in YC? Can anyone point a finger of why this anti take is so prominent? We're living through the most revolutionary moment of software since it's its inception and the main thing that gets consistently upvoted is negativity, FUD and it doesn't work in this case, or it's all slop.

by Trufa

4/4/2026 at 6:54:04 PM

> Can anyone point a finger of why this anti take is so prominent?

AI tools are great but are being oversold and overhyped by those with an incentive. So, there is a continuous drumbeat of "AI will do all the code for you" ! "Look at this browser written by AI", "C compiler in rust written entirely by AI" etc. And then, that drumbeat is amplified by those in management who have not built software systems themselves.

What happened to the AI generated "C compiler in rust" ? or the browser written by AI ? - they remain a steaming pile of almost-working code. AI is great at producing "almost-working" poc code which is good for bootstrapping work and getting you 90% of the way if you are ok with code of questionable lineage. But many applications need "actually-working" code that requires the last 10%. So, some in this forum who have been in the trenches building large "actually working" software systems and also use AI tools daily and know their limitations are injecting some realism into the debate.

by bwfan123

4/5/2026 at 12:23:31 AM

I think the anti-AI stance has been reversing on HN as tooling improves and people try it. It’s only been a little over a year since Claude Code was released, and 3 or 4 months since the models got really capable. People need time to adjust, even if I would expect devs to be more up-to-date than most.

People’s willingness to argue about technology they’ve barely used is always bewildering to me though.

by sothatsit

4/4/2026 at 7:08:23 PM

Not speaking for myself but the you won’t have a job soon narrative puts people off

by arealaccount

4/4/2026 at 10:52:48 AM

On the other hand, some bugs take three months to find. So this still seems like a win.

by addandsubtract

4/5/2026 at 12:53:29 PM

You know this how?

by mcswell

4/4/2026 at 4:37:55 PM

From a recent front page article that mentioned the previous slop problem:

> Now most of these reports are correct, to the point that we had to bring in more maintainers to help us.

https://news.ycombinator.com/item?id=47611921

by sixhobbits

4/4/2026 at 2:57:41 PM

[dead]

by xeromal

4/4/2026 at 11:35:30 AM

[flagged]

by khalic

4/4/2026 at 11:49:06 AM

[flagged]

by j16sdiz

4/4/2026 at 11:53:22 AM

He explicitly talks about not sending the maintainers slop, learn how to read.

by khalic

4/7/2026 at 1:22:32 PM

[dead]

by Srinathprasanna

4/4/2026 at 6:04:54 PM

[dead]

by jeremie_strand

4/4/2026 at 3:09:41 PM

[dead]

by jeremie_strand

4/4/2026 at 1:07:29 PM

[dead]

by pithtkn

4/5/2026 at 9:57:51 AM

[dead]

by Serberus

4/4/2026 at 4:39:44 PM

[flagged]

by dfir-lab

4/4/2026 at 11:03:10 AM

[flagged]

by LeonTing1010

4/4/2026 at 12:51:35 PM

[flagged]

by adamsilvacons

4/4/2026 at 9:53:23 PM

[dead]

by redoh

4/6/2026 at 8:24:57 AM

[dead]

by helenazdenova

4/4/2026 at 1:32:33 PM

[dead]

by roach54023

4/5/2026 at 2:02:37 PM

[flagged]

by noritaka88

4/4/2026 at 4:16:37 PM

[dead]

by claudexai

4/4/2026 at 6:15:36 PM

[flagged]

by skyskys

4/4/2026 at 11:06:13 AM

[flagged]

by lnkl

4/4/2026 at 1:45:55 PM

Every single post here these days. “Startup founder of Communality.ai says ai good for people” and then the comments are AI bros declaring that all work can end, the good times are here at last

by FromTheFirstIn

4/4/2026 at 6:29:14 PM

[flagged]

by yunnpp

4/4/2026 at 7:08:13 PM

[flagged]

by CaptainFever

4/4/2026 at 7:40:52 PM

Thank you for your kind comment. I recommend you watch the actual talk, and then understand what exploiting RCEs in things like the Linux kernel at such a scale that defenders can no longer keep up with actually means. The latter is their claim, not mine.

Also realize that, unlike a security researcher, an attacker doesn't necessarily need to review the model out carefully to filter out the slop before a bug submission. They mostly just need to run the shit.

by yunnpp

4/4/2026 at 8:00:14 PM

Is your pitch that the reports are slop? Or that they’re so dangerous it’s morally indefensible to share the research?

by akerl_

4/4/2026 at 8:07:10 PM

A good chunk of the reports are false positives (slop) per the researcher's own admission in his talk. I have no issue sharing the bug reports either; the bugs are better fixed.

What I take issue with is that they have basically released the weapon first without thinking about the consequences. And again, if you watch the talk, you'll see how he literally calls others to action to fix the problem. They made a problem and are asking you to fix it, and it will also cost you money, which conveniently goes to them. Any industry with even a semblance of regulation would find this very disturbing.

by yunnpp

4/4/2026 at 9:07:53 PM

The “weapon” here is identifying vulnerabilities that were already present and exploitable by malicious actors?

by akerl_

4/5/2026 at 12:20:59 AM

A very shallow dismissal of my point. Is there no room for depth in your logical analysis?

First of all, we don't know whether this particular bug was already being exploited in the wild. We do know that there is a community of experts looking at the Linux kernel and reporting bugs. Yet this bug had never been reported until now. So either nobody ever looked there (unlikely), or they did and didn't find it. Conversely, the LLM found it with a prompt that even a 5-year old can type. That significantly lowers the effort for the attacker, so much that it changes the game. It is, to use a crude analogy, like deploying firearms in a field traditionally fought with sword and shield. So yes, that's the weapon, and these guys released the stuff to the public with no oversight. That should get some people thinking.

by yunnpp

4/5/2026 at 12:44:32 AM

> So either nobody ever looked there (unlikely), or they did and didn't find it.

Those aren't the only two options.

by akerl_

4/4/2026 at 7:27:31 PM

More like, if you pay a fee to use a service, you can find the bombs already hidden somewhere in your premises.

by tosti

4/4/2026 at 7:33:40 PM

And? They didn't put the bombs on your premises. Before "the service", you had bombs you didn't know about; after, you get to know about them.

by tptacek

4/4/2026 at 7:57:02 PM

But the service also tells criminals and adversaries about the bomb locations.

by par1970

4/4/2026 at 7:58:36 PM

And? So do a variety of other services. Was it your impression that the criminals and adversaries were behind the 8 ball on this?

AI is reviving debates about vulnerability research that we thought we killed off in the 1990s.

by tptacek

4/6/2026 at 1:52:11 PM

Perhaps the argument isn't about the ethics of security research, but rather the divide between those who can afford non-free software licenses and those who ethically or circumstancially can't.

by tosti

4/6/2026 at 2:05:34 PM

You'd see the same thing in 1990s full-disclosure debates, where people trying to create a social/cultural argument against vulnerability research would throw this kind of stuff against the wall just to see what would stick. It's either good to know about vulnerabilities in the code you rely on or it isn't.

by tptacek

4/6/2026 at 6:49:11 PM

Yes, of course. It's a bloody shame some of those tools are inaccessible to the poor, the not poor but f* your stupid payment system that doesn't connect to my bank, the software freedom enthousiasts, possibly others.

For myself, software freedom isn't just an ethical issue but also a practical neccesity.

by tosti

4/4/2026 at 10:21:42 AM

The title is a little misleading.

It was Opus 4.6 (the model). You could discover this with some other coding agent harness.

The other thing that bugs me and frankly I don't have the time to try it out myself, is that they did not compare to see if the same bug would have been found with GPT 5.4 or perhaps even an open source model.

Without that, and for the reasons I posted above, while I am sure this is not the intention, the post reads like an ad for claude code.

by _pdp_

4/4/2026 at 11:07:07 AM

OP here.

I don't understand this critique. Carlini did use Claude Code directly. Claude Code used the Claude Opus 4.6 model, but I don't know why you'd consider it inaccurate to say Claude Code found it.

GPT 5.4 might be capable of finding it as well, but the article never made any claims about whether non-Anthropic models could find it.

If I wrote about achieving 10k QPS with a Go server, is the article misleading unless I enumerate every other technology that could have achieved the same thing?

by mtlynch

4/5/2026 at 12:29:36 AM

Also, he did compare with earlier versions that, before 4.5, were dramatically worse at finding the same problems. There's even a graph. That seems to pretty solidly support the idea that this is "gain of function" as it were...

by eichin

4/4/2026 at 10:32:29 AM

No the title is correct and you are misreading or didn't read. It was found with Claude code, that's the quote. This isn't a model eval, it's an Anthropic employee talking about Claude code. So comparing to other models isn't a thing to reasonably expect.

by mgraczyk

4/4/2026 at 12:13:50 PM

> You could discover this with some other coding agent harness.

And surely that would be relevant if they were using a different harness.

by weird-eye-issue

4/4/2026 at 10:33:55 AM

> Nicholas has found hundreds more potential bugs in the Linux kernel, but the bottleneck to fixing them is the manual step of humans sorting through all of Claude’s findings

No, the problem is sorting out thousands of false positives from claude code's reports. 5 out of 1000+ reports to be valid is statistically worse than running a fuzzer on the codebase.

Just sayin'

by cookiengineer

4/4/2026 at 11:22:27 AM

> 5 out of 1000+ reports to be valid is statistically worse than running a fuzzer on the codebase.

Carlini said "hundreds" of crashes, not 1000+.

It's not that only 5 were true positives and the rest were false positives. 5 were true positives and Carlini doesn't have bandwidth to review the rest. Presumably he's reviewed more than 5 and some were not worth reporting, but we don't know what that number is. It's almost certainly not hundreds.

Keep in mind that Carlini's not a dedicated security engineer for Linux. He's seeing what's possible with LLMs and his team is simultaneously exploring the Linux kernel, Firefox,[0] GhostScript, OpenSC,[1] and probably lots of others that they can't disclose because they're not yet fixed.

[0] https://www.anthropic.com/news/mozilla-firefox-security

[1] https://red.anthropic.com/2026/zero-days/

by mtlynch

4/4/2026 at 10:46:46 AM

> On the kernel security list we've seen a huge bump of reports. We were between 2 and 3 per week maybe two years ago, then reached probably 10 a week over the last year with the only difference being only AI slop, and now since the beginning of the year we're around 5-10 per day depending on the days (fridays and tuesdays seem the worst). Now most of these reports are correct, to the point that we had to bring in more maintainers to help us. ... Also it's interesting to keep thinking that these bugs are within reach from criminals so they deserve to get fixed.

https://lwn.net/Articles/1065620/

by dist-epoch

4/4/2026 at 12:59:03 PM

> https://syzbot.org/upstream

I stand corrected.

by cookiengineer

4/4/2026 at 5:21:26 PM

What's your point?

by tptacek

4/4/2026 at 2:42:52 AM

But on the other hand, Claude might introduce more vulnerability than it discovered.

by up2isomorphism

4/4/2026 at 3:02:32 AM

Code review is the real deal for these models. This area seems largely underappreciated to me. Especially for things like C++, where static analysis tools have traditionally generated too many false positives to be useful, the LLMs seem especially good. I'm no black hat but have found similarly old bugs at my own place. Even if shit is hallucinated half the time, it still pays off when it finds that really nasty bug.

Instead, people seem to be infatuated with vibe coding technical debt at scale.

by yunnpp

4/4/2026 at 10:51:34 AM

> Code review is the real deal for these models.

Yea, that is what I have been saying as well...

>Instead, people seem to be infatuated with vibe coding technical debt at scale.

Don't blame them. That is what AI marketing pushes. And people are sheep to marketing..

I understand why AI companies don't want to promote it. Because they understand that the LCD/Majority of their client base won't see code review as a critical part of their business. If LLMs are marketed as best suited for code review, then they probably cannot justify the investments that they are getting...

by qsera

4/5/2026 at 6:33:00 AM

Real deal in this case or not does not necessarily mean the Claude code usage is a positive net gain to the software security overall. In fact it is likely the opposite.

It will hurt some CC heavy user’s feeling but that’s a different thing.

by up2isomorphism

4/4/2026 at 9:55:26 AM

[dead]

by Serberus

4/4/2026 at 11:44:34 AM

Guys please read the article before commenting...

by khalic

4/4/2026 at 3:00:34 PM

A developer using Claude Code found this bug. Claude is a tool. It is used by developers. It should not sign commits. Neovim never tried to sign commits with me, nor Zed.

by desireco42

4/4/2026 at 4:23:07 PM

Should not Is that your new law? The non-agentic “Neovim and Zed *never tried to sign commits [for]~~with~~ me” therefore no tool ever no matter how advanced is not allowed to sign a commit.

Did it ever occur to you that for whatever reason you just might not be cut out for the software treadmill?

by igravious