6/2/2026 at 7:55:52 PM
> Second, clean data. MAI-Thinking-1 was trained on clean and appropriately licensed data, with AI-generated content excluded from pre-training. This matters for quality, provenance, and control. If we cannot account for what shaped a model, we cannot fully understand its behavior or credibly improve it.Shots fired?
It would be interesting to see how far "clean data" can go on the scaling laws.
by keeda
6/2/2026 at 9:20:54 PM
I would really like to see what "appropriately licensed data" means. Cannot imagine they didn't copy all open repo's on GitHub, and can't imagine they asked for permission, or are reproducing license texts from these repo's now. It sounds hand wavy.P.S. A fairly basic website otherwise, but it unfortunately seems to be hacking scroll for no good reason.
by foresterre
6/3/2026 at 1:27:12 AM
Presumably their position remains that training on public repos is fair use and doesn't require a license. If it doesn't require a license it's still "appropriately licensed".by ralph84
6/2/2026 at 9:24:50 PM
I assume they took the actual repos’ licenses info account. I don’t understand why they should ask for permission when the license would already allow for it.by stingraycharles
6/2/2026 at 10:07:28 PM
Almost all licenses have requirements to redistribute copies of the work, or derivatives thereof. Even permissive licenses do. It's very little to ask when open source dev's provided thousands of hours of free work.For example, the Apache 2.0 license requires in just 4.c:
You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works;
Just because they're tokenized and transformed into a probabilistic mapping, doesn't suddenly mean that they weren't copied.I find it morally unethical that they (likely) just ingest IP of all open source repo's without asking, but also importantly without any attribution.
Let me also note that I'm not against LLM's in general. But I do think training on open source must be opt-in, and I look forward to a world with actually ethical, and traceable (i.e. on what they were trained on, like a bill of materials (BOM)), models.
by foresterre
6/2/2026 at 9:30:18 PM
Which licenses allow usage for training? MIT, BSD, etc likely do. But I would expect it gets weird for all the various copyleft licences.by rocqua
6/2/2026 at 9:56:47 PM
Why would it get weird for those?by cortesoft
6/2/2026 at 10:02:40 PM
Theoretically it mandates that derivative works use same license but it's unclear if that applies to LLM outputs.by rzmmm
6/2/2026 at 10:34:08 PM
Recently, GitHub has changed their terms of service to use all user data for AI training unless users explicitly opt out. This is probably the way Microsoft has obtained "appropriately licensed data".by VortexLain
6/2/2026 at 10:53:27 PM
this is almost certainly too recent to have been used for training data, no? Unless they optimistically included most repos somehow?by mattnewton
6/2/2026 at 9:33:47 PM
It's interesting because their last model series (Phi) was based around the thesis that high-quality synthetic data is better than a large pre-training corpus.by supermdguy
6/3/2026 at 3:20:45 AM
[dead]by inquirerGeneral
6/2/2026 at 8:21:56 PM
I doubt any lab would say otherwise, they all _claim_ to use licensed databy vdfs
6/2/2026 at 8:41:18 PM
Maybe, but Microsoft, through their partnership with OpenAI, is already involved in major copyright lawsuits. That is probably a driving force for this move, actually... I doubt they would want to tempt fate while those lawsuits are on-going.by keeda
6/2/2026 at 10:05:49 PM
all the labs "clean" their pretraining data, and you can have your pretraining data to be minimally ai generated but also spam synthetic post-training databy vanuatu
6/2/2026 at 9:45:25 PM
I'd assume it's not up to par with Qwen-3.5 then, which has been distilling Claude, and the quality of the model is probably a direct result of that.by swalsh
6/2/2026 at 7:59:07 PM
I'm interested how much "Clean Data" is synthetic data from "unclean" models...by onlyrealcuzzo
6/2/2026 at 8:53:58 PM
So, laundered data?by bicx
6/2/2026 at 8:19:14 PM
> with AI-generated content excluded from pre-training.> without distillation from third-party models
sounds like zero unless they are lying.
by ertgbnm
6/2/2026 at 8:31:12 PM
> with AI-generated content excluded from pre-training.Though this is largely impossible these days, unless they pre-trained on pre-AI era data.
by zamalek
6/2/2026 at 10:16:35 PM
That could be. Just use pre-training for language understanding and let the post-training on synthetic data do the heavy lifting.by stymaar
6/2/2026 at 9:20:19 PM
"how many of those shapes are rectangles?" "sounds like zero unless they are squares"Adding "unless" to a statement makes it vacuous if the latter clause is weaker than the first clause. I find it hard to believe that a company willing to violate licenses would have scruples about lying about it.
by saghm
6/2/2026 at 9:35:05 PM
Not vacuous, but tautological. Which is different, because tautologies can actually be quite directly informative. Whereas vacuous truths tend to be oblique.Also, “Microsoft is lying” is not a logically stronger statement, because they might be lying about something other than whether they distilled or trained on AI output.
by rocqua
6/2/2026 at 9:28:53 PM
Adding "unless" to a statement makes it vacuous if the latter clause is weaker than the first clauseI think that's the point. "How do I say they're lying without outright saying they're lying?"
It's a common rhetorical trick.
by chongli
6/3/2026 at 8:45:18 AM
Or the speaker is just not in the mood to argue with someone whose response will be, "you trust anything Microsoft say?"by Leynos
6/2/2026 at 8:01:47 PM
“ We trained it from the ground up on enterprise grade, clean and commercially licensed data, without distillation from third-party models.”by xavriley
6/2/2026 at 8:13:04 PM
aka all of GitHub OSSby azinman2
6/3/2026 at 4:47:18 AM
Not OSS only, likely also the enterprise private repos, with a lot of business secrets.by rurban
6/2/2026 at 9:17:51 PM
Yeah this is exactly what I was thinking.by ChicagoDave
6/2/2026 at 10:02:25 PM
Interesting. Wasn't their previous attempt (Phi) trained mostly on synthetic data?by andai