Hardening Firefox with Claude Mythos Preview

5/7/2026 at 9:16:17 PM

Again, and this is important:

A bug is a bug. A “potential vulnerability” is a bug. A vulnerability is verifiable as having security implications with a proof of concept or other substantial evidence.

Words matter. Bugs matter. It’s important to fix large amounts of bugs, just as it always has been, and has been done. Let that be impressive on its own, because it IS impressive.

Mythos didn’t write 271 PoC for vulnerabilities and demonstrate code path reachability with security implications. Mythos found 271 valid bugs. Let that be enough.

by jerrythegerbil

5/7/2026 at 9:58:18 PM

I was a bit confused by your definitions, but here's how Mozilla broke out [1] the 271, um, things:

> As additional context, we apply security severity ratings from critical to low to indicate the urgency of a bug:

> * sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior, like browsing to a web page. We make no technical difference between these, but sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.

> * sec-moderate is assigned to vulnerabilities that would otherwise be rated sec-high but require unusual and complex steps from the victim.

> * sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).

> Of the 271 bugs we announced for Firefox 150: 180 were sec-high, 80 were sec-moderate, and 11 were sec-low.

Mozilla uses the term "vulnerability" for even sec-high, even though they say right below that it doesn't mean the same thing as a practical exploit. And on their definitional page, they classify even sec-low as "vulnerabilities" [2].

Words are tools, that get their utility from collective meaning. I'd be interested where you recieved your semantics from and if they match up or disagree with Mozilla.

[1] https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...

[2] https://wiki.mozilla.org/Security_Severity_Ratings/Client

by epistasis

5/7/2026 at 10:47:49 PM

I work at Mozilla; I fixed a bunch of these bugs.

In general, I would say that our use of "vulnerability" lines up with what jerrythegerbil calls "potential vulnerability". (In cases with a POC, we would likely use the word "exploit".) Our goal is to keep Firefox secure. Once it's clear that a particular bug might be exploitable, it's usually not worth a lot of engineering effort to investigate further; we just fix it. We spend a little while eyeballing things for the purpose of sorting into sec-high, sec-moderate, etc, and to help triage incoming bugs, but if there's any real question, we assume the worst and move on.

So were all 271 bugs exploitable? Absolutely not. But they were all security bugs according to the normal standards that we've been applying for years.

(Partial exception: there were some bugs that might normally have been opened up, but were kept hidden because Mythos wasn't public information yet. But those bugs would have been marked sec-other, and not included in the count.)

So if you think we're guilty of inflating the number of "real" vulnerabilities found by Mythos, bear in mind that we've also been consistently inflating the baseline. The spike in the Firefox Security Fixes by Month graph is very, very real: https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...

by IainIreland

5/7/2026 at 10:52:23 PM

What types of vulnerabilities was it finding? Cross site scripting, privilege escalation, etc? Mostly memory corruption or any Javascript logic bugs?

by paulvnickerson

5/7/2026 at 11:12:56 PM

I work on SpiderMonkey, so I mostly looked at the JS bugs. It was a smorgasbord of various things. Broadly speaking I'd say the most impressive bugs were TOCTOU issues, where we checked something and later acted on it, and the testcase found a clever way to invalidate the result of the check in between.

If you look closely at, say, this patch, you might get a sense of what I mean (although the real cleverness is in the testcase, which we have not made public): https://hg-edge.mozilla.org/integration/autoland/rev/c29515d...

by IainIreland

5/7/2026 at 11:17:47 PM

> although the real cleverness is in the testcase, which we have not made public

What is the point of keeping it private? I'd bet feeding this patch to Opus and asking to look for specific TOCTOU issue fixed by the patch will make it come up with a testcase sooner or later.

by reisse

5/7/2026 at 11:57:11 PM

The same is also true of a good security researcher, and has been for a long time. The question is mostly whether it takes long enough to come up with a testcase that we've managed to ship the fix to all affected releases, and given people some time to update. (And maybe LLMs do change the calculus there! We'll have to wait and see.)

by IainIreland

5/7/2026 at 11:31:35 PM

Possibly! One of the many areas that might need rethinking in the age of AI (that started in February of this year) is how long security bugs should be hidden. We live in interesting times.

by mccr8

5/8/2026 at 10:19:11 AM

Given the commit is 4 weeks old, will it eventually get comments?

The code before the patch does not look obviously wrong. Now, some more lines were added, but would you now say it now looks less obviously wrong, or more obviously correct?

It seems that the invariants needed here are either in some person's heads, or in some document that is not referenced.

Reading the code for the first time, the immediate question is: "What other lines might be missing? How can I figure?"

If the "obviously correct" level of the code does not increase for a human reviewer, how is it ensured that a similar problem will not arise in the future? Or do we need more LLM to tell us which other lines need to be added?

by nh2

5/8/2026 at 3:10:11 PM

Yeah, the test with the patch also adds comments. The human reviewer had extra context available.

I did get Opus to do an audit for similar problems elsewhere, to supplement the investigations that we were already doing by hand. It initially thought it found something, but when asked to produce a testcase, it thought for 20 minutes and admitted defeat. I suspect that the difference between Opus and Mythos is in small edges like this: if Mythos is smart enough to spot why Opus's discovery didn't work a little bit faster, and it can waste less time chasing down red herrings, then it's more likely to find a real bug within the limits of a context window. It's not that Opus completely lacks some capability, it's that it has trouble chaining all the pieces together consistently.

by IainIreland

5/9/2026 at 5:41:46 PM

Can't remember when I last heard TOCTOU term being used. Nice :)

by seebeen

5/7/2026 at 11:18:30 PM

Very cool, thank you.

by paulvnickerson

5/7/2026 at 11:29:32 PM

I'd say it leans towards memory corruption kinds of issues, as those are easiest to pass the validator, thanks to AddressSanitizer. I think there's a lot of potential for making the validator more sophisticated. Like maybe you add a JS function that will only crash when run in the parent process and have a validator that checks for that specific crash, as a way for the LLM to "prove" that it managed to run arbitrary JS in the parent. Would that turn up subtler issues? Maybe.

by mccr8

5/8/2026 at 1:24:27 PM

You may not be able to comment, but do you feel like Mythos is accomplishing anything that couldn't have already been done with Opus and the right prompting?

I've assumed I could send an agent using a publicly available model bug hunting in a codebase like this and get tons of results, assuming I wanted to burn the tokens, so it's really unclear to me whether the Mythos hype is justified or if it's just an easy button (and subsidized tokens?) to do what is already possible.

by JeremyNT

5/8/2026 at 3:38:33 PM

I never got direct access to Mythos, so all I know is what I've seen from the quality of the bugs being produced. I also haven't been involved at the prompting end.

So the best answer I can give is: I dunno, maybe it's possible to find bugs like this using Opus, but if so, where are they? Did nobody think to try "please find the bug in this code" pre-Mythos? I've done enough auditing with Opus to be convinced that it can be a good assistant to somebody who already knows what they're doing, but in practice the big wave of AI-discovered bugs started with Mythos.

I'm sure lots of people have assumed they could send a publicly available model bug hunting and find things. I have not noticed a huge amount of success. We've had some very nice correctness bugs reported, but skimming through the list of security bugs I've fixed recently, the AI-related ones all seem to be Mythos.

My best guess is that Mythos is just enough better along just enough axes that its hit rate on finding potential bugs and filtering out the real ones from the hallucinations is good enough to matter. Like, there's no obvious qualitative difference between 3.6kg of uranium-232 and 3.8 kg of uranium-232, just a small quantitative increase. But if you form both of them into spheres, only one of them has reached critical mass. Can you do something clever to reach critical mass with 3.6kg of uranium? Maybe! But needing to do something clever is a non-trivial barrier in itself.

by IainIreland

5/8/2026 at 4:47:21 PM

I did some experiments and Opus seemed pretty able to wire up a harness to find bugs and write PoC + patch for each. It's still a lot of work to get fixes upstreamed from outside so I think even if outsiders have better tools (Mythos etc) it won't change the report rate much, people may find more bugs but they won't report them. I suspect that's part of the calculation of the phased rollout for Mythos, finding bugs is already not the bottleneck.

by hedgehog

5/8/2026 at 5:52:21 PM

> It's still a lot of work to get fixes upstreamed from outside

I'm going to disagree in the specific case of Firefox. First, although it has diverged a long way from its roots, Mozilla still has the community project ideal in its DNA. Enough, at least, that I stumbled while reading the clause "from outside" -- if you're finding and reporting actual relevant security bugs, you're already on the inside. SpiderMonkey in particular still has a good amount of code being written and even maintained by non-employees. (Examples: Temporal and LoongArch64 JIT support).

Second, the bug bounty program still exists[0] and is being used. If someone were sitting on a pile of AI-discovered exploits, then it has monetary value which is rapidly draining away the longer they aren't reported.[1] That's incentive to put in the work to report them properly.

Third, I agree that finding bugs is likely not the bottleneck. Validating them is. With previous models, the false positive rate was too high so they required too much work to whittle down to the valid ones. A PoC is a very strong signal that a bug is valid, and that's where I just don't believe you: without a really good harness, I don't think Opus was good enough to find very many bugs with PoCs. It could find some, just not very many.[2]

[0] For now. It remains to be seen how it will adapt to the AI age. For the moment, it hasn't been severely nerfed like Google's.

[1] One could make the argument that people who are inexpert enough to only be able to poke an AI to find bugs are also the people more likely to sell them on the black market rather than disclosing them. It seems plausible. Still, some people would still be disclosing, and not many were filing quality bugs pre-Mythos. Some were, but it was a trickle compared to post-Mythos.

[2] Also note that I personally, as a SpiderMonkey developer, don't find a huge amount of value in the AI-generated patches that accompany these bug reports. Sometimes they're useful to better illustrate the problem, especially since the AI's problem analysis is usually subtly wrong in important ways. They can be a decent starting point for a real patch. But I'll still need to go through my own process of figuring out what the right fix is, even in the handful of cases where I end up with the same thing the AI did.

by sfink

5/10/2026 at 8:55:36 AM

Hi! First of all, thanks for your incredibly thoughtful and enlightening answers, and most of all for helping keep Firefox alive.

You said:

> Still, some people would still be disclosing, and not many were filing quality bugs pre-Mythos. Some were, but it was a trickle compared to post-Mythos.

How much of this could be just due to focus? i.e. prior to the partnership with Anthropic to test Mythos Preview, has there ever been a similarly focused project, specifically trying to find security bugs in Firefox?

by cassianoleal

5/11/2026 at 7:36:39 PM

That's a fair point, given the restrictions on Mythos and now Opus 4.7. I'm kind of comparing apples and oranges.

There are two things mixed together here. There is targeted scanning that was done by both Anthropic and Mozilla employees, using first Opus and then Mythos. Then there are other non-employee security researchers using AI to find and file bugs, motivated mostly by bug bounties.

The researchers were filing a steady trickle of bugs presumably using Opus 4.6. (Or rather, I saw a steady trickle after other people triaged them; I imagine the incoming stream was a lot busier.) My impression is that those have mostly dried up now. That could be the bias in my sample (I only see a slice of incoming bugs, so my anecdata aren't that strong), or a result of the restrictions added to the generally available models, or a result of there being less to find now that we've fixed so many of the issues found by company-backed bughunts. Or a combination of all three.

I guess my opinion is mostly driven by the difference in the quality and magnitude of bugs coming in from the company-backed scans pre- and post-Mythos. With Opus, there was an initial rush, but then it mostly died down. (For our group. For other groups, it was a series of waves that they never quite made it over before the next one came crashing in.) With Mythos, it was a larger wave and the quality of the bugs was higher. Two quantitative differences that ended up feeling like a qualitative change. So it's my underinformed personal opinion, but to me it feels like: yes, you could continue to find more bugs using a roughly Opus 4.6-strength model, but not that many and not cheaply, and the success rate is going to depend a lot on the harness. In comparison, I don't think we've seen the end of the Mythos wave, and my sense is that Mythos requires much less in the way of a harness.

It feels like the bitter lesson is playing itself out again, which I kinda hate because I want human ingenuity and cleverness to make an important difference, even after the next model has seen what the humans are coming up with.

by sfink

5/11/2026 at 8:04:52 PM

That makes sense, thanks for taking the time to write this up!

by cassianoleal

5/8/2026 at 5:54:15 PM

We have a bounty program. If you can find security bugs in Firefox, please let us pay you for them. You don't need to provide a fix; a testcase that crashes in an interesting way is often enough to qualify.

https://www.mozilla.org/en-US/security/client-bug-bounty/

by IainIreland

5/8/2026 at 5:18:35 PM

> I suspect that's part of the calculation of the phased rollout for Mythos, finding bugs is already not the bottleneck.

I was wondering this too. By working directly with tech companies and (one assumes) subsidizing tokens, they're empowering the people on the inside who absolutely want to have the bugs fixed.

Who outside of Mozilla is going to pay and spend the effort to find Firefox bugs? Sure some hobbyists and contributors might, but they don't have the institutional knowledge of the codebase which can help guide an agent prompts, nor do they have strong incentives to try and report them, nor do they necessarily have the time to craft good bug reports that stand out from the slop reports.

My assumption would be that most people working to discover bugs this way in Firefox are interested in using them rather than getting them fixed, so maintainers wouldn't necessarily even know the degree to which it was already happening.

by JeremyNT

5/8/2026 at 7:51:14 PM

The incentive is that Mozilla will pay you thousands of dollars if you find a security bug: https://www.mozilla.org/en-US/security/client-bug-bounty/

We have many outside contributors who have successfully submitted security bugs and received payments.

by mccr8

5/7/2026 at 10:51:59 PM

I'm not a security dev or researcher or anything, but as an outsider my understanding matches how Mozilla uses the terms. Though words used by specialists and the general public can offer differ...

by epistasis

5/7/2026 at 11:34:52 PM

Can you elaborate why those bugs weren't found by e.g. fuzzing in the past?

I'm genuinely curious what "types" of implementation mistakes these were, like whether e.g. it was library usage bugs, state management bugs, control flow bugs etc.

Would love to see a writeup about these findings, maybe Mythos hinted us towards that better fuzzing tools are needed?

by cookiengineer

5/8/2026 at 12:21:49 AM

If I had to guess, I'd say that AI is better at finding TOCTOU bugs than fuzzing because it starts by looking at the code and trying to find problems with it, which naturally leads it to experiment with questions like "is there any way to make this assumption false?", whereas fuzzing is more brute force. Fuzzing can explore way more possible states, but AI is better at picking good ones.

In this particular sense, AI tends to find bugs that are closer to what we'd see from a human researcher reading the code. Fuzz bugs are often more "here's a seemingly innocuous sequence of statements that randomly happen to collide three corner cases in an unexpected way".

Outside of SpiderMonkey, my understanding is that many of the best vulnerabilities were in code that is difficult to fuzz effectively for whatever reason.

by IainIreland

5/8/2026 at 12:12:48 AM

Fuzzing isn't good at things like dealing with code behind a CRC check, whereas the audit based approach using an LLMs can see the sketchy code, then calculate the CRC itself to come up with a test case. I think you end up having to write custom fuzzing harnesses to get at the vulnerable parts of the code. (This is an example from a talk by somebody at Anthropic.)

That being said, I think there's a lot of potential for synergy here: if LLMs make writing code easier, that includes fuzzers, so maybe fuzzers will also end up finding a lot more bugs. I saw somebody on Twitter say they used an LLM to write a fuzzer for Chrome and found a number of security bugs that they reported.

by mccr8

5/8/2026 at 3:31:33 AM

How about this: a "vulnerability" is a "vulnerability", but after it was identified and verified to cause problem, that's when it should be called a "bug", because it could make the software do unwanted things.

by nirui

5/8/2026 at 5:36:32 AM

At Mozilla, everything is called a bug. It's what other systems call an "issue". So it's too late for your terminology at Mozilla. (Example: I have a bug to improve the HTML output of my static analysis tool. There is nothing incorrect or flawed about the current output.)

At Mozilla, but not everywhere: exploits are a subset of vulnerabilities are a subset of bugs.

by sfink

5/8/2026 at 10:59:13 AM

Fwiw i think this is right. A bug is anything that doesn't do what you want it to do, and nobody should want a vulnerability in their software

by freedomben

5/8/2026 at 4:10:15 PM

That's why they created Bugzilla :) (https://www.bugzilla.org/blog/2023/08/26/bugzilla-celebrates...)

by darkwater

5/8/2026 at 9:20:06 AM

When I worked at Mozilla, _everything_ was called a bug, whether it was a software issue, a problem in the office or some paperwork missing.

Much as GitHub calls everything an "issue" and GitLab a "work item".

by Yoric

5/10/2026 at 9:00:21 AM

How is a vulnerability not a bug? Surely you didn’t design the software to enable exploits on your users, right?

Vulnerabilities are a special class of bugs. One that’s generally so important to fix that they get a special name and more attention.

That doesn’t make them less of a bug. It makes them more of one.

by cassianoleal

5/7/2026 at 10:31:21 PM

Presumably there are (implicit?) "sec-none" things, like [a] from the recently released 150.0.2 [b] which makes absolutely zero mention about "Security Impact" or "Severity" in the bug report, unlike [c], which is listed in the Mozilla weblog post [2].

Security things are mentioned in the Release Notes [b] pointing to a completely different document [d].

Perhaps sometimes a bug is 'just' a bug, and not a vulnerability.

[a] https://bugzilla.mozilla.org/show_bug.cgi?id=2034980 ; "Can't highlight image scans in Firefox 150+"

[b] https://www.firefox.com/en-CA/firefox/150.0.2/releasenotes/

[c] https://bugzilla.mozilla.org/show_bug.cgi?id=2024918

[d] https://www.mozilla.org/en-US/security/advisories/mfsa2026-4...

by throw0101c

5/7/2026 at 10:30:02 PM

> Mozilla uses the term "vulnerability" for even sec-high, even though they say right below that it doesn't mean the same thing as a practical exploit.

That’s not evident in what you pastedat all.

What you pasted says

> sec-critical and sec-high are assigned to vulnerabilities that can be triggered with normal user behavior […] We make no technical difference between these […] sec-critical bugs are reserved for issues that are publicly disclosed or known to be exploited in the wild.

> sec-low is assigned to bugs that are annoying but far from causing user harm (e.g, a safe crash).

From this one infers that the "180 were sec-high" bugs found are actually exploitsble but known to have been found in the wild, and are NOT mere annoying bugs.

The difference between 180 and 270 does nothing to deflate the signicance, or lack there of, of the implication re: Mythos.

by Gregaros

5/7/2026 at 10:44:18 PM

Yes, it is not in what I pasted, as I said, "even though they say right below". If you don't believe me then click on either of the links.

by epistasis

5/7/2026 at 11:19:40 PM

Mythos did in fact write PoCs for all bugs that crash with demonstration of memory-unsafe behavior (e.g. use-after-free, out-of-bounds reads/writes, etc).

For us this is substantial enough evidence to consider it a security vulnerability at that point, unless shown otherwise and it has always been this way (also for fuzzing bugs).

by mozdeco

5/8/2026 at 4:13:17 AM

> Mythos did in fact write PoCs for all bugs that crash with demonstration of memory-unsafe behavior (e.g. use-after-free, out-of-bounds reads/writes, etc).

But report [1] says that "Some of these bugs showed evidence of memory corruption...", which implies that majority of these (which includes 271 bugs from Mythos) don't have evidence at all. Do I not understand something?

> For us this is substantial enough evidence to consider it a security vulnerability at that point

Mythos is supposed to be pretty good at writing actual exploits, so (as I understand) there shouldn't be any serious problems with checking if bug is vulnerability or not.

[1] https://www.mozilla.org/en-US/security/advisories/mfsa2026-3...

by ZrArm

5/8/2026 at 12:36:47 PM

> But report [1] says that "Some of these bugs showed evidence of memory corruption...", which implies that majority of these (which includes 271 bugs from Mythos) don't have evidence at all. Do I not understand something?

This is just the standard sentence we've been using for years. It has nothing to do with Mythos and for Mythos, almost all bugs show evidence of memory corruption (we do have a handful of bugs in JS IPC / JS Actors, one is in the blog post).

> Mythos is supposed to be pretty good at writing actual exploits, so (as I understand) there shouldn't be any serious problems with checking if bug is vulnerability or not.

Yes but if we have a choice between writing exploits and scanning more source, potentially finding more bugs, then of course we prioritize the latter.

by mozdeco

5/8/2026 at 5:56:00 AM

I'm guessing a bit, but for example: out of bounds reads are not memory corruption. Assertion failures in debug builds are also usually not memory corruption, and I'd guess that many of these bugs were found through assertions. (Some parts of Firefox like the SpiderMonkey JS engine make heavy use of assertions, and that's the biggest signal used for defect validation. An assertion firing is almost always treated as a real and serious problem. Though with our harness, Opus and Mythos try to come up with an exploit PoC anyway.)

by sfink

5/8/2026 at 11:27:32 AM

It makes sense, thanks, even though that wording is still somewhat confusing.

by ZrArm

5/8/2026 at 3:14:27 AM

Is that number of crashing bugs with PoC available/written down anywhere?

by jerrythegerbil

5/8/2026 at 8:36:44 PM

As described in our blog posts, our harness/pipeline only looks for crashes so all of the bugs resulting from that do have PoCs. There is a smaller number of bugs found by manual auditing that didn't have PoCs but I'd say easily more than 90% of all of the bugs we are talking about had a PoC.

by mozdeco

5/8/2026 at 10:04:02 AM

It may be worth noting that Claude can and will (if it believes you own the code, at least) produce PoC exploits for exploitable bugs that it finds.

My only source for this is personal experience, and no, I can't share any evidence of it.

by jeffparsons

5/8/2026 at 11:02:26 AM

Are you certified for high risk cyber uses? If so then you're correct. If not, then it does not match my experience

by freedomben

5/8/2026 at 12:59:09 PM

The word “exploit” may be doing a lot of work here. In my experience Opus 4.6 is perfectly happy to provide test cases that trigger ASAN, even without the super secret squirrel security access.

But if you ask it to get you a shell it’ll probably tell you to get lost.

by cvwright

5/10/2026 at 5:17:17 AM

I don't have any special certification or arrangement with Anthropic; this is vanilla Opus 4.x via Claude Code.

by jeffparsons

5/8/2026 at 4:28:22 PM

I'm not and I'm doing it right now.

by staticassertion

5/8/2026 at 12:29:33 AM

This isn’t true anywhere people have to make decisions about what to work on first.

by browningstreet

5/7/2026 at 11:42:39 PM

> Mythos didn’t write 271 PoC for vulnerabilities

I think the word you're looking for is exploit?

by dataflow

5/8/2026 at 1:44:57 AM

I dismissed the earlier non-technical blog post as shameless product boosterism for Anthropic. The linked hacks blog (which is a better source than this article) is a welcome release. It's hard to deny there's something real to this now, I think. Mozilla's internal definition of a "vulnerability" is also probably more widely applied than what many would intuit, but it is good that these issues are being taken seriously and fixed.

by kajman

5/8/2026 at 6:36:44 AM

> The linked hacks blog

https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...

by mmahemoff

5/8/2026 at 5:32:52 AM

At the same time other companies like AISLE are matching Mythos on vulnerabilities using older models but their own harnass: https://aisle.com/blog/aisle-matches-anthropic-mythos-on-fre...

So while Mythos certainly is real I think you could do the same with Deepseek pro, GPT 5.5 etc...

by apexalpha

5/8/2026 at 12:50:10 PM

I used to work with a guy who would always say "if you're looking for trouble, you are going to find it"

When I hear that "we found X bugs using some new tool", where the standard for bugs is low and doesn't neccessarily require user impact in realistic scenarios, I think to myself- duh! You went looking for bugs, of course you found them.

For a sufficiently complicated product, in my experience, you don't have to look far.

by jonfw

5/8/2026 at 1:45:35 PM

Well it helps if 'looking for bugs' doesn't cost $300 per hour per set of eyes.

by apexalpha

5/8/2026 at 10:01:42 PM

how much does it cost? my understanding of Mythos is that it runs a lot to find issues

by metalliqaz

5/9/2026 at 12:47:38 AM

The things I’ve read from various open source orgs with access to it is that Anthropic is giving them unmetered access for now as part of Glasswing. I’d bet that the corporate partners have to pay though.

by ls612

5/8/2026 at 2:56:20 PM

> if you're looking for trouble, you are going to find it

That's the "'No Way to Prevent This,' Says Only Nation Where This Regularly Happens" of unsafe languages.

There are huge swathes of problems we know how to categorically prevent, but some people won't do it because they're more comfortable believing it was never preventable than accepting any culpability for not preventing it previously.

by tialaramex

5/8/2026 at 6:14:17 AM

As the Hacks.Mozilla article notes: "We began with small-scale experiments prompting the harness to look for sandbox escapes with Claude Opus 4.6. Even with this model, we identified an impressive amount of previously-unknown vulnerabilities which required complex reasoning over multiprocess browser engine code."

by yorwba

5/8/2026 at 5:47:28 AM

Agreed. The earlier blog post did not explicitly claim this, but I think casual viewers were prompted to believe that the Magic of Mythos (TM) went and found (and fixed??) a bunch of vulnerabilities with minimal human guidance, and even contrasted this with their fuzzing infrastructure and made it sound (to me) like it was casting shade on it.

This new post makes it pretty clear that this was all bolted on-top of their existing fuzzing infrastructure, and really just used to get more and better initial hits that a very skilled team is looking at. I assume Anthropic was giving them a very good deal on inference for the positive PR, but I believe these other reports and suspect Mozilla did not really need them.

by kajman

5/8/2026 at 2:14:35 PM

Wasn't AISLE only able to find the same bugs when it was shown only the known faulty code? The worrying part about Mythos isn't the fact that it can find bugs. The worrying part is Mythos being able to find them on its own across entire code base as vast as Firefox then write exploits for what its found with a very basic prompt.

The skill required to find then create zero days is quickly approaching the floor.

by wnevets

5/8/2026 at 2:37:35 PM

I think they split the codebase in smaller files or modules and then tell the AI there's a bug in this particular file and to go find it.

Then they loop over a codebase like this. This way you always point a model at a 'known' bug. And I assume a smaller context window helps with quality.

Not entirely sure it's obviously proprietary.

by apexalpha

5/7/2026 at 7:49:48 PM

Original source: https://news.ycombinator.com/item?id=48051079

It's better because it actually lists a sample of Bugzilla reports that were made public. This topic was discussed previously (36 comments two weeks ago: https://news.ycombinator.com/item?id=47885042), but the part about bug reports being made public is brand new.

by input_sh

5/7/2026 at 7:16:26 PM

I hope to see the day when (or if) the LLMs get so good at spotting and fixing bugs that all that’s left for the Firefox engineers to do is to focus on adding new features.

This isn’t sarcasm. Firefox deserves to be used more. Most people I know don’t use it because “Chrome does almost everything better”, and Firefox can’t compete with the other browsers’ roadmaps.

by Diti

5/8/2026 at 6:49:09 AM

Chrome ain't better in any meaningful way for >99% of use cases. Heck, I am a dev and I use FF with ublock origin only, for past... 10 years?

Same with my wife, after I've explained things to her and she understood how different internet experience can be thats the primary browser.

So please don't put the argument like 'here is crappy underdog but please use it because monopoly is bad and google is a bit evil', its first class experience in everything I have ever thrown at it. Tripple that on mobile, by far the best mobile and useful mobile experience, bar none.

by kakacik

5/8/2026 at 11:06:37 AM

Agreed, though many websites only test on Chrome and are unusable on Firefox. Ramp.com and mailgun come immediately to mind. Zoom also won't let you join with browser on Firefox. There's enough that I have to keep a chrome available for those types of sites. It shouldn't be this way, but it is

by freedomben

5/8/2026 at 1:55:59 PM

> Zoom also won't let you join with browser on Firefox.

FF works for me.

by mnicky

5/8/2026 at 2:38:10 PM

do you hack your user agent or anything to get it to show you the option?

by freedomben

5/8/2026 at 6:28:19 PM

You haven't needed to in quite a while. Well, assuming you're on desktop? I guess I've never tried it on mobile. That feels like asking for trouble.

I ran Zoom in my Firefox desktop browser for a while, but it tended to overheat my laptop. Other things overheat it too, so I don't know how much was specific to Zoom on Firefox.

I just checked. Still gives me the option ("Join from browser" in a less highlighted option, trying to drive you to their native client I guess.)

by sfink

5/7/2026 at 7:58:46 PM

> Firefox deserves to be used more

Totally agree. I even go as far as choosing which website I make purchases on depending if they work on FF, or writing to support occasionally to tell them it's not supported or a feature isn't working properly and this would be appreciated.

I know it pretty much always goes nowhere, but I feel it's what I can do to keep the browser somehow on the radar.

by greggoB

5/7/2026 at 8:55:13 PM

> Firefox deserves to be used more

Part of the problem is, when they stop working on fixing bugs, they start doing Mr Robot things... We just want a web browser. Nobody asked for pocket, or AI...

If they use AI to fix all the bugs, then what else is for them to do, other than maintain syntax compatibility with the various languages they build with? They're just going to go back to making the browser trash again.

by nubinetwork

5/8/2026 at 4:55:57 AM

I have an old apk of Firefox pinned on mobile. I do this because I genuinely believe that for my very limited usecases, the browser has become actively worse.

(Don't worry- I use the system browser for any site I don't fully trust.)

by FeepingCreature

5/8/2026 at 4:05:25 PM

when was the last time you made a contribution financially or physically? Those features were economically tied, who's giving the dough?

by cobalt60

5/8/2026 at 12:21:00 PM

Browser haven't needed new features in a very long time. Extensions were supposed to be the solution to that.

by kgwxd

5/8/2026 at 3:19:49 PM

Wouldn't that quality and availability just allow Chrome to pull ahead of Firefox that much faster?

If Mozilla created some proprietary LLM or harness that they used internally to outpace Chrome that may be a different story, though I also don't see that happening.

by _heimdall

5/8/2026 at 11:11:12 AM

Unfortunately we're probably still quite far from that. This is the best case for LLMs still - the quality of their output didn't matter as long as it worked, and there was a near-perfect oracle for checking if their output worked.

That's a really good use case for LLMs. It also applies to things like finding proofs in Lean and creating test stimulus. In both cases you know automatically whether the output is good, and it doesn't really matter if it isn't.

That isn't the case for most bugs, and definitely isn't the case for actually fixing bugs.

by IshKebab

5/7/2026 at 11:33:47 PM

They've only linked a few tickets, so of course maybe when we see all 271 actual distinct things the insight won't apply but all those I examined ended up as some C++ code with a nasty bug in it.

Firefox is written in several languages, only about 25% of it is in C++ but every single one of these issues seems to touch the C++.

by tialaramex

5/7/2026 at 11:53:27 PM

A general limitation of this approach is that it is only as good as your validator, and there's nothing easier to validate than a test case that creates, say, an AddressSanitizer use-after-free. For subtler issues will we have to more specific validators or will the LLM become better at coming up with other dangerous conditions it will verify? We'll see.

by mccr8

5/8/2026 at 1:54:06 AM

> A general limitation of this approach is that it is only as good as your validator, and there's nothing easier to validate than a test case that creates, say, an AddressSanitizer use-after-free

Sure, but, surely AddressSanitizer would also detect the same problem in the C or Rust which together also make up about 25% of Firefox so... ?

by tialaramex

5/8/2026 at 8:05:56 AM

It's possible Mythos is a lot better at finding vulnerabilities in C++ code than it is for other languages. After all, these models are also based on pre-existing security analysis.

From what I can tell, a lot of these bugs were hardly C++-specific, they just happened in C++ code. Even the most secure Rust can't magically catch things like TOCTOU issues.

by jeroenhd

5/8/2026 at 9:25:31 AM

> Even the most secure Rust can't magically catch things like TOCTOU issues

I suppose it depends what the word "magically" means. A ToCToU race is because you imagined things wouldn't change but they did and in Rust you actually do write fewer patterns with this mistake because of the Mutable xor Aliased rule. If we have at least one immutable reference to a Goose then Rust isn't OK with anybody mutating the Goose, your safe Rust can't do that and unsafe Rust mustn't do that. So the ToCToU race caused by "Oops I forgot somebody else might change the Goose" is less likely because you were made to wrestle with this problem during design - the safe Rust where you just forgot about this doesn't compile.

by tialaramex

5/8/2026 at 11:12:25 AM

It's because they verified the bugs using AddressSanitizer so by construction it was only ever going to find C++ bugs.

by IshKebab

5/8/2026 at 11:31:41 AM

But there is AddressSanitizer for Rust and for C too right? As I understand it AddressSanitizer consumes LLVM IR, so from its point of view some C, C++ or Rust is all the same, and presumably also if you are a famous Russian streamer and you hand wrote LLVM IR instead of using a real programming language that too?

by tialaramex

5/8/2026 at 12:44:45 PM

Yes I was including C in "C++". I dunno how much C Firefox uses.

And I presume you can run AddressSanitizer with Rust but given Rust is memory safe by default, it's only going to find issues in `unsafe` code which is a tiny tiny fraction of most code. Google had a blog post a few months ago where they managed to put some actual numbers on this, because they almost shipped one Rust memory safety bug.

by IshKebab

5/8/2026 at 2:28:51 PM

The lesson for other projects is very different if the reason these are all C++ bugs is just "We didn't ask Mythos for the bugs in Rust" versus if the difference is that asking Mythos for similar bugs in the Rust is futile because it won't find any.

Some of this is tempered if the pattern is that Mythos finds bugs mostly in dusty old C++ but the rates are much, much lower in newer C++, the reverse of Google's earlier finding for human researchers.

by tialaramex

5/8/2026 at 3:34:55 PM

> The lesson for other projects is very different if the reason these are all C++ bugs is just "We didn't ask Mythos for the bugs in Rust" versus if the difference is that asking Mythos for similar bugs in the Rust is futile because it won't find any.

The answer is both of those. They didn't ask for bugs in the Rust code because it wouldn't have found any. They've explicitly set it up to only look for memory safety bugs. It's not going to find any in a memory safe language.

by IshKebab

5/8/2026 at 3:43:16 PM

As long as the memory-safe subset of Rust is used exclusively.

by uecker

5/8/2026 at 9:32:30 PM

Not exclusively, just the vast majority of the time. Which it is.

Read this: https://blog.google/security/rust-in-android-move-fast-fix-t...

Exactly the same as using the memory-safe subset of Python or Java.

by IshKebab

5/9/2026 at 5:55:50 AM

There exist memory safety bugs in Rust projects, so you will find them. Or maybe not with AI, as there is not enough training data?

The 70% number google claims is either BS or google-specific as other projects reported far lower numbers.

by uecker

5/9/2026 at 8:11:44 AM

> There exist memory safety bugs in Rust projects, so you will find them. Or maybe not with AI, as there is not enough training data?

No, there are simply too few memory safety bugs in Rust projects for AI to find any. It found 271 bugs in Firefox so you're talking around 0.3 bugs found in the same amount of Rust.

> The 70% number google claims is either BS or google-specific as other projects reported far lower numbers.

The post I linked didn't mention 70% so I guess you didn't read it. And if you're talking about the "70% of C/C++ security bugs are due to memory safety" stat, then no it isn't bullshit. The same (or very similar) number has been found by numerous companies and projects. Not that that stat is relevant here.

by IshKebab

5/9/2026 at 8:48:35 AM

It is impossible to interpret this number (271) without looking into details. People certainly found plenty of memory safety and others bugs in Rust projects in the past, so I do not understand you claim that there too few to find any.

Curl reported 40% and more recently it dropped to about 20% of issues caused by their use of C. And this even with the requirement to stick to old C89. OpenBSD reported 30%. I assume the 70% either have to do with C++ or - more likely - there is a huge selection bias.

by uecker

5/9/2026 at 8:40:16 PM

> I assume the 70% either have to do with C++ or - more likely - there is a huge selection bias.

Daniel admits that he "might" just be counting differently.

I expect some of it is C++ because there sure is plenty of additional complexity to fit in the same size brain and yet you retain the same absolute requirement to juggle everything at all times or the software blows up but I'd be very surprised if it accounted for this huge disparity.

by tialaramex

5/10/2026 at 6:53:55 AM

Well, I would say google "might" be counting differently or have huge bias.

by uecker

5/8/2026 at 4:04:29 PM

> It's not going to find any in a memory safe language.

I mean, it's not supposed to find any in the unsafe language either, but that's why it was used.

Firefox not only uses unstable Rust features (via the exemption mechanism the same way Linux does it, trained professionals, closed course, do not attempt at home) it also presumably has some volume of its own explicitly unsafe Rust and so there's no reason this could not be checked, and what makes the difference here is whether it was or was not.

by tialaramex

5/8/2026 at 9:30:58 PM

> I mean, it's not supposed to find any in the unsafe language either, but that's why it was used.

No it is supposed to find them in C++, because we all know humans are infallible and it's super easy to write memory errors in C++.

The whole point of Rust is that the borrow checker is infallible (pretty much anyway).

> it also presumably has some volume of its own explicitly unsafe Rust

"Some volume" is so tiny as to be irrelevant. There's no point going to this effort if Rust memory safety vulnerabilities are 1000 times less frequent than in C++.

That number is not made up. See https://blog.google/security/rust-in-android-move-fast-fix-t...

by IshKebab

5/8/2026 at 10:49:03 PM

I assume you intended either "humans are fallible" or "humans aren't infallible" ?

I'd like to understand if Rust was skipped because they assumed it would be fine, skipped purely as happenstance, or in fact tested and found to not be problem. I don't like assuming things when I could measure instead.

by tialaramex

5/9/2026 at 8:12:38 AM

> I assume you intended either "humans are fallible" or "humans aren't infallible" ?

Ha yes.

by IshKebab

5/8/2026 at 3:49:52 AM

Reading this article in the context of the Zig folks refusing to even consider LLM-generated bugs certainly shapes my perspective on what technologies will be in my toolchain.

by benced

5/8/2026 at 11:04:11 PM

Zig devs can run Mythos same as anyone else can. I think you are failing to understand the reasoning behind their decision. It's about contributors, not contributions.

https://kristoff.it/blog/contributor-poker-and-ai/

by tkel

5/9/2026 at 1:41:34 AM

Quoting from https://ziglang.org/code-of-conduct/#strict-no-llm-no-ai-pol...:

> Strict No LLM / No AI Policy

> No LLMs for issues.

> No LLMs for pull requests.

> No LLMs for comments on the bug tracker, including translation. English is encouraged, but not required. You are welcome to post in your native language and rely on others to have their own translation tools of choice to interpret your words.

If they would accept issues filed by AI or written by AI, they should edit their policy to say that.

by benced

5/9/2026 at 3:24:53 AM

None of that is at odds with what I said. An AI isn't a human that you can invest time into developing into a contributor. I'm not sure what you aren't understanding. Did you read the article I linked?

by tkel

5/9/2026 at 4:27:31 AM

I did and I read it previously. It has not changed my interpretation that the Zig project would decline issues mostly generated by LLMs due to their stated policy saying so.

If you’re saying their philosophy is compatible with LLM issues, I agree and I think they should change their policy to reflect that.

by benced

5/9/2026 at 6:51:38 AM

I'm not saying that. It seems as though you aren't engaging with the content of their post explaining their reasoning or are missing the point.

by tkel

5/8/2026 at 9:27:45 AM

Both are right and it depends on which model you use and who submits those bugs. The capabilities of leading models went from 99% noise to 99% valid bugs in essentially a few months. Some projects are flooded with the former and need to take precautions to avoid essential DoS attacks on the maintainers.

by grumbelbart2

5/8/2026 at 3:34:31 PM

Yeah but Zig essentially has till Mythos releases (and arguably less time) to fix their policy. OSS that doesn't take LLM security reports at that point is a liability.

by benced

5/8/2026 at 12:10:13 AM

When I was at PalmSource, I tried to get budget for CoVerity or Fortify (static code analysis tools.). "Too expensive," my management chain said. I spent another year putting together a deal for a lower cost but limited to scanning the network stack. "No, it's based on BSD and BSD is inherently secure," my management chain said (neither is true, btw.)

I eventually left and wound up at Mozilla where there were a number of /* flawfinder ignore */ comments scattered throughout the code.

My guess is that Mythos just ignored the "flawfinder ignore" directives and reported the known vulnerabilities in the code.

by OhMeadhbh

5/8/2026 at 6:46:28 AM

The code is open. If you can prove that's the case you'll have a real news story...

by AndrewDucker

5/9/2026 at 3:12:45 PM

Even a quick glance at the bugs revealed in the blog post would quickly disprove your theory.

by bvisness

5/8/2026 at 3:43:06 AM

What are people's thoughts on how this could affect static analysis tools? I know they are very different beats but often they achieve the same goal. Static analysis tools can be slow, and they report lots of false positives.

I wonder if these models will get good + cheap enough so that people rarely reach for static analysis.

by fg137

5/8/2026 at 7:21:13 AM

LLMs are much better at using tools than replacing tools. The tools are generally a lot faster than trying to achieve the same result with an LLM.

Using LLM coding tools to stay on top of static analysis tool output works very well and adding some guard rails that enforce that there are no issues is probably a good idea. Just like adding CI checks to make sure everything is clean.

As for false positives, it depends on the tool. I tend to avoid tools that generate mostly noise. Most of these tools allow you to disable rules if they produce a lot of noise. Or you can just tell the LLM to fix all the issues. When it's cheaper to fix things than to argue with the rule, just fix it. That used to be really expensive when you had to do that manually. Now it isn't.

I recently did this to an Ansible code base that I needed to refresh after not touching it for a few years. It had hundreds of ansible-lint issues; mostly deprecation warnings and some non fatal other warnings. And 10 minutes later I had zero. Mostly they probably weren't very serious ones but it's a form of technical debt. If you have to fix hundreds of warnings manually, you are probably not going to do it. But if you can wave a magical wand and it all goes a way, why not? I adjusted the guard rails so it now always runs ansible-lint and fixes any issues. Only takes a few seconds extra.

by jillesvangurp

5/8/2026 at 8:41:12 AM

One interesting possibility I've seen discussed[0] is enhancing the static analysis tools to find bug shapes that LLMs originally discover. There's no doubt static analysis tools are faster and cheaper, so they scale better to huge codebases and may allow generalizing the LLM's methods.

[0] https://lwn.net/Articles/1068968/

by lumpa

5/8/2026 at 1:07:45 PM

> What are people's thoughts on how this could affect static analysis tools?

I think these harnesses are _using_ static analysis tools, and probably will continue to do that.

by empath75

5/8/2026 at 6:19:48 AM

I've been thinking about this. Static analysis tools can also be much faster and most are fully deterministic, so including them in CI can catch bugs or latent bugs before they have a chance to land.

I maintain a static analysis tool using in Firefox's CI. False positives have to be fixed or annotated as non-problems in order for you to land a patch in our tree. That means permitting zero positives (false or true), which is a strict threshold. This is a conscious tradeoff; it requires weakening the analysis and getting some false negatives (missed bugs) in order to keep the signal-to-noise ratio high enough that people don't just ignore it and annotate everything away, or stop running it. Nearly all static analysis tools have to do this balancing act.

AI, as commonly used, is given more leeway. It's kind of fundamental that it must be allowed to hallucinate false positives; that's the source of much of its power. Which means you need layers of verification and validation on top of it. It can be slow, you'll never be able to say "it catches 100% of the errors of this particular form: ...", and yet it catches so much stuff.

Data point: my analysis didn't cover one case that I erroneously thought was unlikely to produce true positives (real bugs), and was more complex to implement than seemed worth the trouble. Opus or Mythos, I'm not sure which, started reporting vulnerabilities stemming from that case, so I scrambled and extended the analysis to cover the gap. It took me long enough to implement that by the time I had a full scan of the source tree, Claude had found every important problem that it reported. The static analysis found several others, and I still honestly don't know whether any of them could ever be triggered in practice.

I still think there's value in the static analysis. Some of those occurrences of the problematic pattern might be reachable now through paths too tricky for the AI to construct. Some of them might turn into real problems when other code changes. It seems worth having fixes for all of them now for both possibilities, and also for the lesser reason of not wanting the AI to waste time trying to exploit them. At the same time, clearly the cost/benefit balance has shifted.

They could also team up: if I relax my standards and allow my analysis to write an additional warnings report of suspected problems, with the clear expectation that they might be false alarms, then I could feed that list to an AI to validate them. Essentially, feed slop to the slop machine and have it nondeterministically filter out the diamonds in the rough.

Food for thought...

by sfink

5/8/2026 at 12:40:01 PM

Which tool?

by pabs3

5/8/2026 at 3:48:37 PM

https://firefox-source-docs.mozilla.org/js/HazardAnalysis/in... though I really ought to update that page.

It's for detecting a specific situation: you grab a pointer to a GC-managed object, call something that might possibly trigger a GC even if it probably won't, and then use the pointer. (The GC might collect that object, or it might move it somewhere else.)

Claude is pretty good at weaponizing these UAFs.

by sfink

5/7/2026 at 11:43:58 PM

In the latest Mission Impossible, saving the world depends on recovering the original software of an escaped superhuman AGI from a sunken Russian submarine. Luther writes a "poison pill" that given the original source will instantly one-shot the AI. We were left to wonder how this magical code could have been written, but now we know. Luthor just wrote a Mythos prompt that handed it the source code and asked for an immutable critical exploit.

by delichon

5/8/2026 at 8:13:36 AM

> Anyone building software can start using a harness with a modern model to find bugs and harden their code today. We recommend getting started now.

From what I understand, that is a recipe for getting quickly banned by commercial LLM providers?

by jwr

5/7/2026 at 10:53:22 PM

Curious if people think LLMs will lead to more secure or less secure software in five years.

by crummy

5/8/2026 at 7:43:11 AM

It will probably wipe out a few categories of issues, which is probably a good thing. And those things that still are still insecure can also be translated to some other language.

Translating things to Rust manually was already a thing before LLMs came into the picture. Now with LLMs that's only going to get easier and faster. The long term value is going to come from getting on top of the mountain of technical debt in the form of existing C/C++ code bases that is responsible for the vast majority of memory exploits, buffer overflows, and other issues that despite decades of attention still are being found across major code bases on a regular basis.

Mozilla finding these issues comes on the back of a quarter century of some very competent engineers trying to do the right thing and using all the tools at their disposal to prevent these issues from happening. I have a lot of respect for that team and the contributions it has made over the years to improve tools, testing/verification practices, etc. The issue is not their effort or competence.

The job of taking an existing system that is well covered in test, well documented/specified, etc. and producing a new one that can function as a drop in replacement is now something that can be considered. A few years ago that would have translated into absolutely massive project cost and risk. Now it's something you can kick off on a Friday afternoon. Worst case it doesn't work, best case you end up with a much better implementation.

It's still early days. There are still a lot of quality issues with LLM generated code. But the success/fail rate will probably improve over time.

by jillesvangurp

5/7/2026 at 10:57:36 PM

Both. The skilled will use them to find problems, the unskilled will use them to slopcode insecure software the skilled will have to fix.

by int32_64

5/8/2026 at 1:09:18 AM

Kinda like home-improvement stores, power tools, easily available hardware and youtube tutorials led to both incredibly amazing and durable furniture, as well as janky, ugly and even dangerous furniture.

More tools for more people equals more stuff being made on a wider range.

by mc3301

5/8/2026 at 4:58:38 AM

More secure software, but in the same way that the population is net healthier after a plague.

by FeepingCreature

5/8/2026 at 12:42:07 AM

I’m just happy we’re talking about security.

That will make software safer alone.

by data-ottawa

5/7/2026 at 11:07:10 PM

That depends on which side has more money.

by stavros

5/8/2026 at 12:11:03 AM

In 5 years attackers have an advantage but in the long run I think more secure if developers use LLMs on software to find and fix all of the worse remotely exploitable bugs before release. LLMs are going to force devs to be much more security conscious.

by UltraSane

5/9/2026 at 2:12:30 AM

I think it'll be a war of who has the better LLMs-as-security-scanner.

Ideally, you'd do a comprehensive all-source-code scan, (and the LLM-scanner finds everything during those scans), and fix all the reported defects.

Afterwards, any dev that commits code will run the LLM-scanner on the modified code (and affected areas) and fix any reported defects.

So the black-hat hacker would be shut out unless they get access to an LLM-scanner with better analysis than what the target project is using.

Major LLM-scanners could give priority access for new versions of LLM-scanners to major projects to find any defects in the current source code before any other party could use the reported defects against the project or their users.

So black-hat hackers would be left with developing their own LLM-scanner better/more efficient than existing major LLM-scanners.

Given enough incentive, they might develop such a tool. Look at the market for zero-day vulnerabilities for smartphones, esp iPhones.

by canucker2016

5/8/2026 at 10:09:53 AM

More secure, at least in the cases where the tools are properly applied.

But it also represents more easily available opportunities for blackhats to abuse against the projects where these tools were not being applied.

by vga1

5/7/2026 at 11:38:50 PM

One of the biggest issues in security historically imo is vendors who think, well nobody will ever find this bug so we can deprioritize fixing it. LLMs will prevent vendors lying to themselves which will lead to more secure software.

by bawolff

5/8/2026 at 10:28:37 AM

Less secure because of all the ways attacks can scale out and hackers can contribute vulnerabilities to active projects.

by 2ndorderthought

5/8/2026 at 7:14:11 AM

Really it was not the issue that Opus could not do all these, there was just no incentive to fix bugs. Mythos represented a real marketing use case, so yes thanks for spending money to fix this, but this is not sustainable.

by danieltanfh95

5/7/2026 at 11:24:36 PM

> “That’s the key thing that has unlocked our ability to operate at the scale we’ve been operating at now,” he said. “It gives the engineer a crank they can pull that says: ‘Yep, this has the problem,’ and then you can iterate on the code and know clearly when you’ve fixed it and eventually land the test case in the tree such that you don’t regress it.”

I don't understand much of this paragraph:

* "a crank they can pull that says: ‘Yep, this has the problem,’": as in, ring an alarm? Does the LLM ring th alarm?

* "you can iterate on the code and know clearly when you’ve fixed it": Isn't that true of most bugs, assuming you do the normal thing and generate a test case? And I thought the LLM output test cases itself: "It will craft test cases. We have our existing fuzzing systems and tools to be able to run those tests" And are they claiming the LLM facilitates iterating?

* "and eventually land the test case in the tree": Don't you create the test case before the fix? And just a few words earlier they seemed to be working on the fix, not the test case. And see the prior point about test cases.

* "such that you don’t regress it.”: How is the LLM helping here?

Maybe I'm missing some fundamental unwritten assumption?

by mmooss

5/7/2026 at 11:59:21 PM

Mostly I think this just means that having a test case makes it easier to fix and verify. You can't actually take for granted having a test case when fixing a security bug. Sometimes you only have a crash stack or maybe a vague and hypothetical static analysis result.

> eventually land the test case

This is just a reference to the fact that we don't land test cases for security bugs immediately in the public repository, to make it harder for attackers. You are right that the LLM only helps with creating the initial test case. Things like running the test case in automation is part of the standard development process.

by mccr8

5/8/2026 at 4:44:39 AM

Thank you; that makes sense.

by mmooss

5/7/2026 at 7:43:22 PM

Let's see, how this will improve the daily soc work. I still don't see, what's the big difference between Mythos and Opus, security wise. I'm confident, that this kind of vul detection is a long-term improvement. But does specifically Mythos makes such a big difference to "normal" models? I would love to see, what's the actual difference.

by lschueller

5/7/2026 at 9:37:30 PM

Quantifying the abilities of an LLM is a hard research problem, so I'm not sure if I can describe it in any great way, but Mythos did seem to be fairly clever about putting together things from different domains to find problems.

For instance, in one of the included bugs (2022034) it figured out that a floating point value being sent over IPC could be modified by an attacker in such a way that it would be interpreted by the JS engine as an arbitrary pointer, due to the way the JS engine uses a clever representation of values called NaN-boxing. This is not beyond the realm of a human researcher to find, but it did nicely combine different domains of security.

As the person responsible for accidentally introducing that security problem (and then fixing it after the Mythos report), while I am aware of NaN-boxing (despite not being a JS engine expert), I was focused more on the other more complex parts of this IPC deserialization code so I hadn't really thought about the potential problems in this context. It is just a floating point value, what could go wrong?

by mccr8

5/7/2026 at 10:24:39 PM

Okay, so far it makes sense to me. But is the deal with JS and floating point values, which isn't soemthing super special super rare stuff, only detected and identfied by Mythos while Opus wouldn't get to this point?

by lschueller

5/7/2026 at 11:00:30 PM

There doesn't have to be a huge qualitative discontinuity between Opus and Mythos. It's just that Mythos has reached a threshold where it's finally smart enough that putting it in a loop and asking it to find bugs is suddenly really effective. Especially at the beginning, Mozilla wasn't doing anything particularly clever with prompts. Mythos is just smart enough that the hit rate on obvious prompts is high enough to matter. (Maybe you can get similar performance out of Opus 4.6 with really smart prompts, but AFAICT nobody had managed it until Mythos.)

by IainIreland

5/7/2026 at 7:46:57 PM

Among other things, Mythos seems better at "let me find, weaponize, and stack vulnerabilities until I get end-to-end from untrusted content to root", rather than just finding one thing in a specific identified area.

by JoshTriplett

5/8/2026 at 12:08:49 AM

Results similar to mythos have been duplicated by weaker models.

Think it's more a care of mythos raising widespread awareness that tireless LLMs can be weaponized to dig through code and find that one tiny flaw nobody spotted

by Havoc

5/8/2026 at 6:27:21 AM

The report I saw kind of seemed to be pointing at a flaw and asking "do you see it?" which is not the same thing. I felt a pretty large difference between Opus 4.6's results and Mythos's, so I would be surprised if even weaker models did anywhere near as well. I'd like to see these results, if they are using a decent methodology.

Of course, even the reports with flawed methodology could be suggesting that a great harness + weak model might achieve a similar level of results as a mediocre harness + strong model. But I'd want to see solid evidence for that.

by sfink

5/8/2026 at 1:16:42 PM

There is a phase transition where LLMs match or exceed humans' ability to do something, and from that point on, even if the difference between its previous version is small, it will go from something people use rarely, to something that people use all the time.

There was a time when the entire transportation infrastructure in the US was built around horses. Even after cars were invented, the cars weren't obviously better than horses for most people, especially because there wasn't any infrastructure to support them, but the infrastructure and the cars kept improving to the point where it was better for some people at some things, then suddenly it was better at most things, and then people stopped using horses, and we re-organized our entire transportation network around cars.

But there was never a revolutionary technological change. The technology of cars in the 1930s was the same fundamental technology as the cars in the 1890s. Just at some point it became "good enough" and that was it.

I think when people say that AI is a bubble, they are assuming that anything economically useful that LLMs cannot perform today is _qualitatively_ different from what LLMs can do right now, and that LLMs cannot do it even in theory, without some major technological innovation. But I have a suspicion that there are a large number of valuable things, that once LLMs advance just a little bit more, and harnesses and infra around them is improved a little bit more will just be completely taken over by LLMs.

by empath75

5/7/2026 at 9:04:55 PM

I'm curious about how did Mozilla do bug finding before Mythos? Did they use any non-AI bug finding tools?

by MetaverseClub

5/7/2026 at 9:06:56 PM

The usual sorts of fuzzing and static analyses, using AddressSanitizer and ThreadSanitizer. Also, with a bug bounty program to try to encourage external researchers to report issues. (I work on Firefox security; also I fixed 2 of the bugs linked in the blog post.)

by mccr8

5/7/2026 at 10:17:29 PM

Coverity (similar to lint) scans various open source software products for vulnerabilities.

see https://www.blackduck.com/static-analysis-tools-sast/coverit...

and for Firefox-related alleged defects, see https://scan.coverity.com/projects/firefox

You have to create an account to view the actual reported defects.

There are just over 5000 reported defects still outstanding. I don't know how many overlap with the reported 271 Mythos-reported defects.

by canucker2016

5/7/2026 at 10:28:01 PM

How many of those are false positives though? Probably just over 5000?

You get bug bounties if you report the kind of bugs Mythos identified. There's a reason no-one collected bounties from the "5000 defects" Coverity identified.

The Mythos reports have several examples of chaining a whole bunch of logic in different parts of the program together to exploit something very subtle. The Coverity reports aren't anything like that. These tools aren't remotely in the same league or even universe.

by rockdoe

5/7/2026 at 11:02:15 PM

Yeah, fuzzing, sanitizers, and bug bounties were our main pre-AI tools for finding bugs.

by IainIreland

5/7/2026 at 11:21:38 PM

it's just sad that Coverity represents the best working C++ static analysis tool.

by MetaverseClub

5/9/2026 at 2:27:08 AM

There's also PVS-Studio. They also scan open source projects - see https://pvs-studio.com/en/blog/inspections/

It's hard to convince managers to spend money on static analysis tools (or any development tool).

Unless your company just got bad publicity for a bug and your devs come to you and demonstrate that a certain static analysis tool would have flagged that particular piece of code, most managers would let the beancounter-facet dominate the decision making process.

by canucker2016

5/8/2026 at 6:42:47 AM

The best general purpose one, anyway. Specialty tools can be much better for their niches. Heck, compiler warnings are one such niche tool, and some of them are quite good.

by sfink

5/7/2026 at 11:33:59 PM

Firefox developers do fix issues found by Coverity. I haven't looked at the results in over a decade, but the last time I did there were a few code patterns we used in a lot of places which Coverity didn't like (but were actually okay the way we were doing them) which resulted in a colossal number of false positives.

by mccr8

5/9/2026 at 6:40:45 PM

> We fixed a total of 423 security bugs in releases in April.

And any one or two of them could have led to a random web ad stealing your ssh keys and installing a keylogger to get into your bank account.

I often wonder if we're taking the right route with computer security. Would we be better off having a whole virtual machine for every web page, application or service? Or even physically separate hardware you just vnc into?

by londons_explore

5/7/2026 at 11:44:15 PM

16 day old story

Wired: Mozilla Used Anthropic's Mythos to Find and Fix 271 Bugs in Firefox (41 points, 18 comments) https://news.ycombinator.com/item?id=47853649

Ars: Mozilla: Anthropic's Mythos found 271 security vulnerabilities in Firefox 150 (33 points, 8 comments)https://news.ycombinator.com/item?id=47855384

by gnabgib

5/8/2026 at 12:33:42 AM

No, it's a new post, see also

https://hacks.mozilla.org/2026/05/behind-the-scenes-hardenin...

by mozdeco

5/8/2026 at 12:39:26 AM

That's this post.. and while it's more detail on the same headline (271 bugs) these discussions look the same as 2 weeks ago (and the same as all the bloggers and podcasters discussed)

by gnabgib

5/8/2026 at 1:59:58 PM

This is great, and it reflects some of the changes I've seen in the changelogs of Firefox and many others that have utilize Mythos. I'm closely watching a supposed data wall for AI models and this is a clear indicator that AI capabilities can still become much more advanced even at this point in time. It makes me enthusiastic about future releases and optimizations. Thanks for sharing.

by kittikitti

5/8/2026 at 12:37:25 PM

The flipside of this is that with AI and prompt injection attacks, you don't need a browser vulnerability to be pwned!

by qsera

5/8/2026 at 4:37:03 PM

I'm having more problems with Firefox 150 than I have had with any other browser update in years. I think I might be the only one though?

by languagehacker

5/8/2026 at 4:43:12 PM

On both desktop and Android, I've been getting HTTPS errors that require two Refresh clicks before the actual page loads.

by isatis

5/7/2026 at 7:36:29 PM

The zero-days are numbered

https://news.ycombinator.com/item?id=47853277

by ChrisArchitect

5/7/2026 at 8:31:12 PM

I just hope they don't start ignoring human created bug reports, as there are still many that haven't been fixed for years.

by xacky

5/8/2026 at 12:12:27 AM

I still don't know the exploit count for Mythos. Is it zero, one, or more?

by nnm

5/8/2026 at 6:51:40 AM

More, many more. See the bug reports linked in the post. I checked a few, and all of them had an exploit in them, and there are definitely more than 1 bugs listed.

Well, depending on how you define "exploit"; some might only read arbitrary pointers or just out of bounds. Those would be useful primitives in a chain of vulnerabilities, not exploits themselves. You'll have to read through the first comments yourself, but if you're hoping that this is all nonsense and ignorable hype, you're going to be disappointed.

by sfink

5/7/2026 at 10:58:42 PM

[flagged]

by deferredgrant

5/7/2026 at 11:27:48 PM

I don't find that number very high. In a project of the size of Firefox, a new version of a compiler with stricter warnings or a draconian interpretation of the C standard can easily find 200 new bugs.

New tools find new bugs, but the oligarchy newspapers report on Mythos and not on clang-22.0.

by rem1099

5/8/2026 at 6:39:54 AM

The raw number of things found by Claude (Opus or Mythos) was much higher and would be more comparable to a new clang warning. I vaguely remember seeing a number early on in this process that was in the mid-thousands. The 271 is a small, validated subset of that. None of the 271 were deemed false positives iiuc. Most instances of a new clang warning will be false positives. (Same as most of the raw problems reported by the AI.)

It is still unclear and open for speculation as to what percentage of all security bugs in Firefox today are being found by the AIs (as opposed to not being found at all). It might be that AI is very good at certain types of problems, even if we can't put our finger on what those types are, and that after the initial wave of bug reports the AI findings will slow to a trickle even while many many other bugs remain in the codebase. Or it might be that AI really does detect most instances of some class of problems and all those bugs will now be gone forever, never to return as long as Mozilla keeps paying the token monster. This is closely related to the oft-asked question "are we better or worse off after both attackers and defenders have access to this new capability?"

by sfink

5/8/2026 at 1:18:46 PM

Mozilla is always looking for new revenue, how likely is it that Anthropic payed for this article?

by legacynl

5/8/2026 at 11:10:30 AM

Maybe if Mozilla focused less on new useless features and redesigns, they would be able to focus more on writing secure and bug-free code.

I'm not only talking about big things like

* Pocket,

* several major UI redesigns and

* the offline translations,

but even tiny useless things like

* browser.urlbar.trimURLs,

* putting the search query in the URL bar instead of the URL after searching from the URL bar,

* messing with the Edit and Resend feature for no reason (the good one that updates the content length is still available at devtools.netmonitor.features.newEditAndResend) and

* probably thousands of little shit like this that took a bunch of developer hours to implement.

All of the above should've been add-ons.

And of course, we know Mozilla spends a lot of money on things unrelated to Firefox at all. It's amazing Firefox is somewhat secure and stable compared to Chrome, which is backed by Google with their infinitely deep pockets.

This is a web browser, after all. Something most people use all the time. Something that accepts untrusted input from thousands of sources every day. People use it pretty much every aspect of their lives - banking, personal communication, porn, expressing political opinions. It's used for viewing PDFs, playing media files, for interacting with a whole bunch of APIs (that IMO shouldn't be part of the web, but they are). Security should be top priority.

by Worf

5/8/2026 at 11:16:12 AM

It would be amazing if we didn’t have to have this conversation on every single thread about anything related to Firefox.

Firefox/Mozilla tries literally anything to expand their feature set, customer base, or revenue stream? They need to stop spending money on that and instead spend money on the free product of theirs that I care about, in exactly the way I want.

Google surveilles the entire world, spends huge amounts on lobbying, degrades their own websites on other browsers? Not a peep, usually.

For my part, I pay mozilla for their VPN service, which I’m sure many here would decry as useless spending that should be going to firefox instead.

by mplanchard

5/8/2026 at 11:28:41 AM

> It would be amazing if we didn’t have to have this conversation on every single thread about anything related to Firefox.

If Firefox starts acting maturely, we can stop having these conversations. Until then we see useless crap in every update while most bugs don't get any meaningful attention. Some changes even made things worse than they were before, for example the new Edit and Resend (not so "new" anymore). If Mozilla starts acting the best interest of the user, stops with the ad BS and doesn't try to be everything all at once and actually focuses on Firefox, I would donate. And so would others. If I donate now, I doubt even 1% of my money would go to anything meaningful, like bug fixing.

> Not a peep, usually.

No, fuck Google and Chrome and even anything Chromium-based. Here's the peep from me.

> For my part, I pay mozilla for their VPN service, which I’m sure many here would decry as useless spending that should be going to firefox instead.

Does the profit from the VPN service go to Firefox? If not, what's the point of having a VPN service.

by Worf