6/25/2026 at 7:17:33 AM
The giants knew this was coming, and soon 95% of AI tasks will be able to be done by open models (coding, research, cowork style work). So why pay a premium? Why use them at all? This leaves the labs with two options:1) push the frontier in a way only massive scale can, and cash in on it (mythos level cyber security, recursive training, frontier science work). There’s big money for never before possible capabilities.
2) own the app layer with their edge in reputation and powered by their infrastructure. Be apple where everyone else is Linux. Do design, coding, research, SMBs, legal, finance, healthcare and more (they are doing all of this).
Will it be enough to justify a Google level valuation? We’ll see how fast they can push it.
by Jackobrien
6/25/2026 at 8:29:31 AM
3) Buy all the RAM, increasing the barrier to entry to push back the tide a bit, in time for a juicy IPO.by fredley
6/25/2026 at 12:49:36 PM
4) Make it illegal to use anything but regulated models.by clickety_clack
6/25/2026 at 8:51:44 PM
License the training corpus and encourage copyright suits against outputs from models trained on unlicensed corpora.by rectang
6/26/2026 at 10:03:07 AM
This won't work if the courts decide that training is fair use, which certainly seems the direction they are going.by amanaplanacanal
6/26/2026 at 11:46:45 AM
Output is a separate issue from training. Courts will never decide that a identical copy spit out by an LLM is non-infringing simply because it went through an LLM stage. Copyright laundering is wishful thinking by tech folks.by rectang
6/26/2026 at 6:29:48 PM
I like to think of llms as seamless plagiarism machines.by julosflb
6/25/2026 at 9:11:15 PM
pretty much what altman and amodei mean when they say 'safety'.by vlian2088
6/26/2026 at 9:00:06 AM
Then they will leave the huge advantage in cost to the competition, I mean their customers competitors. Hard to fathom how US companies will not want to use the cheaper option when EU and Asian companies can.by dada216
6/26/2026 at 2:53:39 PM
Why illegal, just pass these 3000 pages FAA-level certification, export controls and KYC. We're free country, after all!by stackedinserter
6/25/2026 at 7:26:37 PM
a: If making it illegal fails, make it a Federal procurement requirement to use regulated models. Come up with an audit standard that only fits regulated models. Watch the preference trickle down.by forshaper
6/27/2026 at 3:04:44 AM
[dead]by darig
6/25/2026 at 3:11:48 PM
Buying all the RAM can't work forever. Scarcity increases prices, high prices increase supply, improves RAM R&D budgets, and forces users to find ways to economize around low RAM availability.by samuelknight
6/25/2026 at 5:01:14 PM
It doesn't need to work forever. You just need to delay your competitors long enough that you can IPO to great fanfare, and then leave retail investors holding the bag. Founders and big investors get to cash out, everyone else gets screwed.by OkayPhysicist
6/25/2026 at 6:21:13 PM
I doubt that works today. Look at SpaceX the fanfare lasted 3 days before most of the insiders could offload to the retail bag holders. That AI company had the benefit of being attached to the largest technical moat.The existing AI companies can't even prevent their moat from being distilled by the Chinese token reselling industry.
by thrwaway55
6/26/2026 at 3:48:34 PM
This is what it feels like they went with.by picofarad
6/25/2026 at 5:54:17 PM
#1 isn't going to happen because we're actually data limited, not compute limited. You can throw all the compute in the world at bad data and it won't make a difference, but an undertrained model with perfect training data will absolutely slay.#2 isn't going to happen, because these labs have shown they have limited app/design sense, and they also lack the industry connections and domain wisdom to execute.
The way things are actually going to go is that these labs will set up partnerships with huge biotech/engineering/etc firms, and do custom training/inference on specific tasks that promise to be wildly profitable with them, then take royalties on the creation in perpetuity. Why sell inference when you can partner with Pfizer to make a version of Ozempic that also makes people freaky jacked, or partner with Bectel to make a radically safer, more efficient Nuclear power plant?
by CuriouslyC
6/25/2026 at 7:55:52 PM
I don't think "data limited" is true anymore outside of very specialized cases (for instance: https://arxiv.org/abs/2510.01631). As weird as it sounds, training improves a lot with synthetic data.You do need business development to create those relationships. Saying they "have limited ___" mostly means they "haven't yet hired people who are good at ___". That's been changing already; the Claude app is steadily improving and handling more use cases simply through understanding which tools to use, Anthropic is building more relationships to create more tools, and all the frontier model companies are building relationships with companies that have specialized data and want specialized solutions.
I think we're also seeing the frontier model companies offer partners their own ability to run RL on their own data, and then retrain new models on the same data. That's going to make those relationships VERY sticky in ways that won't be obvious from the outside.
by Schiendelman
6/25/2026 at 9:02:18 PM
Can you point me to the parts in that paper that meet those claims, I am reading something different and want to know what I am missing.This study seems to show that there are places where synthetic data, especially related to common crawl.
> Pure synthetic data remains non-advantageous over CC; notably, models trained on pure rephrased synthetic data will underperform those trained on CC at larger models.
But the tradeoffs seem to be different at large scale.
> Overall, these model scaling results suggest synthetic data appears comparably less favorable for pre-training larger LMs relative to its utility in data scaling scenarios. Despite outperforming training on CC, larger models are not as tolerant to a higher ratio synthetic data as larger data budgets. This observation aligns with practices where synthetic data is effective for smaller LMs or specific pre-training phases, but less predominantly used for the largest models.
How I am reading it is there are places where it is useful:
> Notably, any mixture involving synthetic data, or pure synthetic data (except pure QA), is projected to achieve a lower irreducible loss than training only on CommonCrawl.
But it also seems that on textbook scale synthetic data, they did show model collapse vs rephrased data.
> These results contribute mixed evidence on “model collapse" during large-scale single-round (n=1) model training on synthetic data–training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by “model collapse".
IMHO there are some very specific areas where we aren't "data limited", like math, but as your reference states "Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance."
Note the cost of 30% of the total dataset being synthetic, where the model starts amplifying the generator's biases, leading to a permanent degradation in downstream zero-shot capability on unseen out-of-domain natural tasks.
My takeaway is there is nuance where synthetic data is an amplifier and where it is a problem, and in my mind that paper demonstrates it will not solve the data problem in general.
by nyrikki
6/25/2026 at 8:09:21 PM
> we're actually data limitedCorrection: public text data limited.
There's a ridiculous amount of proprietary text and non-text data out there that much of society is run on.
by nomel
6/25/2026 at 5:59:42 PM
what is 'bad data' and 'perfect data' according to you?by dominotw
6/25/2026 at 6:04:04 PM
Worst possible bad data is where the data is orthogonal to the task, so increasing the data never provides information on the task. Perfect data is where the data exactly encapsulates the task being trained.by CuriouslyC
6/25/2026 at 10:19:10 AM
> Be apple where everyone else is Linux.Apple and Linux barely even compete in the same markets. Linux runs on the servers and embedded devices, Apple on the smartphones. Android is technically Linux but not in the "is a good analogy for open weight models" sense because Android is so deeply under the thumb of Google. The main place Linux and Apple actually compete is for PCs and laptops, and that's the market where the thing with 65% market share is Microsoft.
by AnthonyMouse
6/25/2026 at 2:27:15 PM
Apple tried to make servers(they were awesome btw) but lost to Linux.Linux are on more phones than iOS.
by Gud
6/25/2026 at 8:55:05 PM
youre missing the point entirely and opted to entertain your own frameworkby pseudosaid
6/25/2026 at 9:38:51 PM
It's meaningless to suggest doing what Apple does when faced with Linux when the vast majority of Apple's business isn't competing with Linux. The majority of Apple's revenue is from hardware when Linux is software -- that can run on Apple's hardware.by AnthonyMouse
6/25/2026 at 10:40:26 AM
You forgot3. Try to get the government to "certify models" to cause regulatory capture which is what both Anthropic and OpenAI has been pushing. No certification no use in business.
by christkv
6/25/2026 at 8:26:30 AM
> own the app layer with their edge in reputation and powered by their infrastructure. Be apple where everyone else is Linux. Do design, coding, research, SMBs, legal, finance, healthcare and more (they are doing all of this).The problem with this is that there are incumbents in all those spaces doing their own AI agents / platforms, and they're the ones choosing the models they use internally and they sell to their own customers. The margins and the possibility to fine tunie using open weight models, as well as the guarantee they'll keep running at predictable costs (no US orders yanking access), make them a very appealing option.
And if you're a company that needs an AI powered legal software, would you buy it from OpenAI/Anthropic, or from someone who you've already bought legal software from before and has the domain knowledge?
by sofixa
6/25/2026 at 8:45:51 AM
Google already owns the app layer, and hardware, and they are a frontier-level AI research firm.I don't see how Anthropic or OpenAI survives being eaten by DeepSeek et al from the bottom of the stack and Google from the top.
by ForHackernews
6/25/2026 at 9:04:15 AM
The only reason people use google apps is because they are cheap and reliable. The user experience is awful. Have you ever tried to find a document you had open yesterday in drive?by dubbie99
6/25/2026 at 2:35:25 PM
You just got to https://drive.google.com/drive/u/0/recentby nickthegreek
6/25/2026 at 11:25:34 AM
I used their enterprise chat the other week coz one of the clients used itIt is truly amazing how bad it is. Made me miss using MS Teams. No software should make anyone miss using MS Teams
by PunchyHamster
6/25/2026 at 9:59:31 AM
Uh? Recently and frequently opened documents always show up on the first screen as soon as I open the app or website.by hobo_mark
6/25/2026 at 5:17:28 PM
Anthropic is at least renting their datacenters, not owning, so all the capital accounting bullshit is getting laundered by someone else, who will wind up holding that bag.And Anthropic is currently cornering the enterprise coding market, and they were smart to avoid video. Under current economic conditions they're a lot closer to being profitable than anyone else, and they can take advantage of crashing prices for compute if we hit a datacenter-buildout-glut.
by dualvariable
6/25/2026 at 8:05:55 AM
Won’t all they need to do is say “best in class, latest models, fastest” and wine and dine a few execs and those enterprise deals will be signed?In this case the people tasked with using the product won’t actually mind.
by ed_elliott_asc
6/25/2026 at 8:20:36 AM
Yes, exactly that. Be Azure and Office 365 and Sharepoint and AWS where everyone else is Debian Stable on a USB thumbdrive.by actionfromafar
6/25/2026 at 8:37:11 AM
Office 365? Ew, Google docs, please.by fragmede
6/25/2026 at 8:22:49 AM
No one is getting fired for using SotA.by NitpickLawyer
6/25/2026 at 8:29:35 AM
If the price difference is 2x? Sure.If the price difference is 50x? No way.
by spwa4
6/25/2026 at 9:00:01 AM
Tell that to Oracleby RobotToaster
6/25/2026 at 9:56:14 AM
Accenture says "yeah totally CEOs will pay a lot for literal nothing"by watwut
6/25/2026 at 8:43:38 AM
So long as the benefit:cost ratio is still sufficiently high, I don't think anyone gets fired for not scrimping. Better to encourage positive EV behaviour by your employees than to scare them away by firing them for not being perfectly optimal.by brainwad
6/25/2026 at 9:34:44 AM
The CEO won't get in trouble, but the employee who can't justify a bad result/prompt?by ThunderSizzle
6/25/2026 at 5:11:12 PM
Laughs in 2005-era VMWare and EMC...by dualvariable
6/25/2026 at 4:21:38 PM
Well, getting laid off during the bankruptcy spiral is a form of firing.But that is months away, so not my problem?
by saltcured
6/25/2026 at 9:50:17 AM
Mythos was outperformed by small, specific local models in multiple oss project.by orwin
6/25/2026 at 10:09:44 AM
i'd love to hear about this! do you have examples?by RugnirViking
6/25/2026 at 4:37:53 PM
It might be kind of overlooked when people read about the big scary results from mythos; the real breakthrough was probably just as much the application of the (very decent) model through a well engineered wrapper (harness). Other models including codex or glm result in significant findings as well.Harness example: https://github.com/evilsocket/audit
by kyleomalley
6/25/2026 at 11:29:55 AM
https://aisle.com/blog/aisle-discovers-6-new-cves-in-curl-in...by orwin
6/27/2026 at 7:58:34 AM
I see "LLM discovers vulnerability in curl" and I get skeptical, given how Daniel Stenberg has talked about the flood of claimed vulnerabilities that weren't real issues once he looked into them (as most HN readers already know, I'm sure). But it looks like these 6 were real issues, that curl patched once they received the reports. Five ended up rated low and one medium, but given the amount of attention curl gets, I'd honestly be surprised if there were any high-severity issues; in fact, having even one medium-severity issue remaining is slightly surprising to me.by rmunn
6/28/2026 at 4:45:26 PM
To be fair (and for people who didn't click the link), i think most of the vuln were in libcurl, not in curl itself.by orwin