GPT-5.5: Mythos-Like Hacking, Open to All

4/24/2026 at 12:17:28 AM

They say its mythos like, without actually comparing it to Mythos (fair enough, it's not public) but the bar for a model to be mythos-like has to be that you can produce as many novel and high severity security vulns outlined in the Mythos redteam blog. I haven't seen any other lab produce a report like that yet. The proof is in the pudding.

by JellyYelly

4/24/2026 at 8:05:05 AM

> The proof is in the pudding.

Funny you say that, when the Mythos team have produced no proof either.

by cassianoleal

4/24/2026 at 10:01:11 AM

Not sure if the reports like this count? https://www.theregister.com/2026/04/22/mozilla_firefox_mytho...

I don't have strong opinion on that.

by subscribed

4/24/2026 at 8:14:04 AM

I believe they've stated that it would be too dangerous to release.

by maplethorpe

4/24/2026 at 11:55:32 AM

Open to all except it’s not because as soon as you try to use it for security purposes it will shut down and silently route you to a worse model. I was trying to use GPT 5.3 for reverse engineering and got an account warning.

by cedws

4/24/2026 at 8:52:11 AM

Those miss-rate numbers are genuinely eye-opening - dropping from 40% to 10% in what sounds like a single generation is no joke - though it's worth taking any vendor-adjacent benchmark with a grain of salt until the broader security community kicks the tires

by immanuwell

4/23/2026 at 11:18:09 PM

First you need to get through the safety net. I’ve had many productive gpt5.4 sessions hit a roadblock of “ethicality” and pollute the context with multiple rounds of trying to convince it to continue

by WhiteDawn

4/23/2026 at 9:51:16 PM

These plots are terrible. Why is categorical data connected across categories with lines? Why not just use bar plots?

Like in the "Web Vulns in OSS" plot, white box data for Opus 4.7 is not available, but the absurd linear interpolation across categories implies it should be near 60.

by nsingh2

4/23/2026 at 10:14:53 PM

It's just an ad thinly disguised as useful data.

by scottyah

4/23/2026 at 10:31:15 PM

I think the x axis is meant to be time but they screwed it up.

by wmf

4/23/2026 at 10:15:40 PM

Wasn't it already confirmed that small open-weight models were able to detect most of the same headline vulns as mythos? How is this any different?

by strange_quark

4/23/2026 at 10:40:33 PM

No, they are able to detect errors when pointed at them but they have a lot of false positives... making them functionally useless for a large unknown codebase. They also can't build and run an exploit post-identification. Mythos can find vulnerabilities (purportedly) and actually validate them by building and running exploits. This makes it functional and usable for hacking.

by stanfordkid

4/24/2026 at 11:28:56 AM

The only significant difference between Mythos and the older open-weights models was that Mythos found all the bugs alone, while with the older models you had to run many of them in order to find all bugs, because each model found only a part of the bugs.

For the open weights models, we know the exact prompts that have been used to find the bugs. While the prompts had to be rather specific, a good bug-finding harness should be able to generate such prompts automatically, i.e. by running repeatedly a model while requesting to find various classes of bugs.

For Mythos, we do not know what prompts have been used, but Anthropic has admitted that the process was nothing like asking "find the bugs in this project". They have also run Mythos many times on each source file, starting with more generic prompts in order to identify whether a source file is likely to have bugs, and then following with more and more specific prompts, until eventually it became likely that a certain kind of bug exists, when Mythos was run one last time with a prompt that required the confirmation that the bug exists and the possible generation of an exploit or patch.

So Mythos must also be pointed to an error. Using it naively will not provide any results like those reported.

There is no doubt that both Mythos and GPT 5.5 are superior to older models, because you can use a single model and hope to have an adequate bug coverage. But the difference between them and older models has been exaggerated. If you run older models on your own hardware, you can afford to run many models many times on each file. A serious bug searching with Mythos or GPT 5.5 is likely to be very expensive, while likely to provide the same results in most cases.

by adrian_b

4/24/2026 at 1:54:43 AM

i casually asked gemini and codex 200usd subs to find and verify bugs for weeks. it did wrote tests, injected mutations, verified fixes. just promts.

also i had to proxy remote mainnet with localhost to force them to do penetration and dos testing.

mythos is nothing new.

by dlahoda

4/23/2026 at 10:38:15 PM

Do you have a source for this? Not doubting it, but I would like to have something concrete the next time the Mythos horse manure is cited.

by nardons

4/23/2026 at 11:59:03 PM

Probably this: https://aisle.com/blog/ai-cybersecurity-after-mythos-the-jag...

by skirmish

4/24/2026 at 6:03:33 AM

Discussion:

https://news.ycombinator.com/item?id=47732020

“Small models also found the vulnerabilities that Mythos found” (aisle.com)

1,283 points | 12 days ago | 360 comments

by WalterGR

4/23/2026 at 10:34:58 PM

why does this read like an openai ad?

by mertcikla

4/24/2026 at 1:55:12 AM

> GPT-5.5 doesn’t just improve — it pulls away

I think it's also self-aggrandizing.

by kibibu