GLM-5.2 is the new leading open weights model on Artificial Analysis

6/17/2026 at 10:29:44 AM

It seems to really be a nice step-up and is getting quite close to the frontier. I wish they'd start focusing on the reasoning efficiency now, though. I have a simple (relatively) test task to evaluate LLMs: writing a simple math evaluator library in Nim (it's about 400-600 lines total max), and GLM 5.2 (xhigh which maps to max effort) spent over 15 minutes (!) reasoning, spending about 45k tokens, before it finally wrote the first file.

I know it's hard to improve on that, but now that their models are good enough at raw intelligence, I think this should become a higher priority task.

Currently on https://artificialanalysis.ai/#output-tokens GPT 5.5 xhigh spends 16k tokens total on average, GPT 5.5 high is 10k, Fable 5 33k, Opus 4.8 41k, GLM 5.2 is 42k. GPT 5.5 is extremely reasoning efficient.

Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

by Tiberium

6/17/2026 at 11:27:13 AM

GLM 5.2 Max = Opus 4.8 Max in thinking behavior. The thinking chain is so similar, and so is the amount of token usage on the output.

If you want reasonable token usage, you need to run it GLM 5.2 at High. There is little drop in quality from Max to High (for most tasks). And it cuts token usage by 2 a 2.5x. GLM 5.2, Max is really something you only need for complex tasks.

In essence, GLM 5.2 is Opus 4.8 its little brother, at a way, WAY cheaper price.

There has been really no training on Opus models going on, really, none i tell you! /sarcasm

by benjiro29

6/17/2026 at 7:41:50 PM

> GLM 5.2 Max = Opus 4.8 Max in thinking behavior

This is insane! I can't wait until technology progresses to the point we can run these things on consumer hardware!

by matheusmoreira

6/17/2026 at 9:13:54 PM

Are there any indications that this will be possible? Consumer hardware will continue getting better but I can't see 512GB RAM in a MacBook Pro any time soon. I'm hoping linear attention techniques plus MoE will make breakthroughs in size/compression and throughput.

by chartpath

6/18/2026 at 7:43:01 PM

> but I can't see 512GB RAM in a MacBook Pro any time soon

Could totally see this being a comment from a forum in like 1994 but swap out GB for MB and MacBook Pro to whatever the popular consumer pc was at the time

by carter2099

6/19/2026 at 3:15:38 PM

Yeah but the price of RAM wasn't increasing at that point.

by r-w

6/17/2026 at 11:17:24 PM

Well, we're probably not going to be running frontier models anytime soon, but I think the general assumption is smaller models will continue to improve until they're sufficiently good frontier models aren't needed.

There's potentially also augmentation through tools, harnesses and RAG to help boost how well they work without tons of parameters.

by nijave

6/17/2026 at 11:46:02 PM

There will be a 1024GB unified memory MacBook Pro.

by deadbabe

6/17/2026 at 9:35:42 PM

Certainly not any time soon, but I have faith it'll happen one day.

by matheusmoreira

6/18/2026 at 4:35:13 AM

In the last ten years laptop memory footprints have, what, doubled at the low end? Smallest MacBook Pro in 2016 was 8GB, smallest is 16GB today? Max I think has gone up 8x meanwhile, 16 to 128?

I wonder if there's a bit of a chicken-and-egg issue where there wasn't much that demanded 10x the RAM, so there wasn't much pressure to develop more or increase production to support it at consumer prices.

There's wayyyyyyy more demand for memory generally now, so assuming it's not a demand bubble that pops rapidly, I'd expect the new normal to end up at a much higher baseline. 512GB would be 4x greater than today's max, so even with the relatively slow last 10 years development pace, give it five years max?

by majormajor

6/18/2026 at 8:14:19 AM

The problem is that the situation in the RAM market might just... not go away. It's locked in for the next couple of years unless the AI market goes pop. Which it might! But if it doesn't, there's no particular reason to think that the incentives for cornering the market like OpenAI have would go away.

We might see that new normal in five years or so. We will see a new normal sooner than that if there's a run on AI because of the sudden availability of DRR fab capacity, but also we'll probably see the level of local models freeze at whatever state they've got to at that point. But an equally likely outcome is that any new DDR capacity that comes online is just immediately absorbed by frontier AI, and consumer devices stay at "just good enough" for a decade.

by regularfry

6/18/2026 at 5:51:35 AM

The new Macbook Neo is 8GB. I think that if we are lucky, the huge RAM demand right now means new factory buildouts which eventually means more supply and prices go back down, and capacity begins to go up. This level of demand was just not anticipated by anyone.

by mikestorrent

6/17/2026 at 11:19:51 PM

you need 8 x 96GB Blackwell or equivalent

so around US$150k which is Small/Medium-Enterprise territory already, but who knows when it will hit "reasonable" home consumer territory

I think there's hope future generations of unified memory machines may get this sort of memory availability when new fabs open in then next couple of years and then ramp up production for a few years afterwards - that makes ~2030s credible at this point, but nobody can really predict the market that far ahead

by muyuu

6/17/2026 at 11:39:00 PM

> I think there's hope future generations of unified memory machines may get this sort of memory availability

I hope you're right. This is a very exciting idea. The weights are out there. The demand is astronomical. The manufacturers just need to make it happen.

by matheusmoreira

6/17/2026 at 11:51:30 PM

there are cheaper ways to do it. not like, consumer-cheap, but I'm setting up a rig for 80% cheaper than that.

I'm a tad worried about triggering a run on the particular hardware I'm buying though so I'll leave it vague here, but hit me up on Discord if you're curious.

by sterlind

6/18/2026 at 4:49:40 AM

Hey, very intrigued about how it can be done for cheaper. Sent a friend request to sterlind on Discord, interested if you do a write up

by sankalpmukim

6/18/2026 at 5:22:47 AM

But at what kind of speed? We're aiming at some speed that would negate the point of even using an off-site provider.

by muyuu

6/18/2026 at 7:09:43 AM

This is quite evident for personal AI but general intelligence with current scaling laws and how model keep getting better with more number of parameters, certainly the path does not converge. Personal AI is more deprived of context today than quality of token. Having a on-system knowledge base paired with Gemma works well to large extend.

by harshit119

6/17/2026 at 4:19:30 PM

With such ridiculously long thinking traces I'm surprised max outperforms high. After all, performance falls off a hill after a certain amount of context, and long thinking traces can fill that up really quickly.

by FooBarWidget

6/17/2026 at 3:36:25 PM

looking at the score this is rather a gemini 3.5 flash competitor, yes, for cheaper, but distance to opus and fable is as big as their price diff.

by maxdo

6/17/2026 at 11:48:54 AM

distillation of thinking models is not particularly effective - both "Open"AI and Misanthropic don't show you the real chain of thought, only its severely downscaled version. both do everything in their power to combat such outrageous copyright infringement, so the bulk of unethically scrapped data the Chinese have is from several generations ago.

by vitalyan123

6/17/2026 at 6:59:14 PM

It is quite likely that the intermediate tokens don’t have ‘semantic import’[0]

There are methods like Habitual Reasoning Distillation or Inverted Reasoning Traces [1] that can help.

While there are reasons to hide the intermediate tokens from a IP protection stand point, there is also a need to hide more effective and efficient generating that doesn’t fit the R1 claims of an aha moment that has been debunked, but is a consumer expectation.

While hidden intermediate tokens do increase the difficulty, it is not a from barrier in itself, especially as they are billed, given information about their length.

[0] https://arxiv.org/abs/2504.09762v4

[1] https://arxiv.org/abs/2603.07267

by nyrikki

6/17/2026 at 8:41:53 PM

Chinese distillation attacks are about as unethical as Robin Hood stealing from the rich to give to the poor. The real unethical scraping was done by Anthropic to train Claude.

To be clear, if Anthropic was using totally licensed data, I'd be sympathetic to these claims. But if you're going to pirate the world's creativity you'd better be willing to gimme dat shit for free[0].

[0] As said by Hungry Santa.

by kmeisthax

6/17/2026 at 11:51:24 AM

>such outrageous copyright infringement

Sarcasm, considering the source of their own training data?

by duskdozer

6/17/2026 at 2:56:20 PM

Considering they called the company "Misanthropic", sarcasm is a safe bet.

by margalabargala

6/18/2026 at 8:06:51 AM

Somehow, I completely overlooked that.

by duskdozer

6/17/2026 at 12:09:18 PM

Narrator: it was sarcasm, indeed.

by orphea

6/17/2026 at 2:06:19 PM

IP for me, not thee.

by baron3dl

6/17/2026 at 3:21:18 PM

FYI: model outputs are not protected by copyright.

by overfeed

6/17/2026 at 7:11:40 PM

For Claude models at least, you can tell to just manually think in the output and it works fine. I do it reguralrly because for creative writing and summarization, they seem to believe they don't need to think at all, and get way worse results.

by Bolwin

6/17/2026 at 7:57:49 PM

this helps so much. i do it too. with some of the newer frontier models its unclear if you can even turn it off in the first party chat apps. havent compared api semantics yet.

by carterschonwald

6/17/2026 at 1:43:48 PM

The companies that did copyright infringement and unethically scrapped data think that copyright infringement and unethically scrapping data is wrong and needs to be stopped.

Though only in particular situations, like when it’s done to them and not when they do it. Cause they have the power and are morally right and know better than you. And if you question this at all, well you’re a threat to American values and a supporter of the Chinese and leading to the break down of Democracy.

This isn’t a type of reasoning argument or manipulation tactic used by the rich throughout history to trick the naive and gullible masses or anything like that. Trust me, I’m rich and I’m morally right. /sarcasm

by mannanj

6/18/2026 at 4:28:58 AM

It’s been amazing to see the arc of tech people going from “evil Disney, copyright is an abomination, information wants to be free” to “OMG copyright is inviolable and AI is taking money out of Plato’s descendants’ pockets!”

by brookst

6/18/2026 at 5:05:17 AM

> taking money out of Plato’s descendants’ pockets

Yeah, remind me - is it Plato's descendants that people are concerned about here, or is it every single author who had any work in Anna's Archive, any work published online, any work published on github, etc?

I think that people are probably upset about the harm to living people who had their work stolen by Meta and other LLM companies - regardless of license, terms of use, or any other attempted protection.

by solid_fuel

6/18/2026 at 1:53:49 PM

Sure, that’s the motte / bailey. Easy to point to living, starving writers who suffer grevious harm, in defense of perpetual copyright. Disney and others use literally this exact argument year after year.

I’m not even disagreeing. I’m just saying the shift in attitude about copyright in the tech space has been sudden, dramatic, and really funny. Remember “you wouldn’t steal a car”? Today’s anti-AI tech contingent are enthusiastically embracing that false equivalence that we all laughed at 20 years ago.

by brookst

6/18/2026 at 3:32:55 PM

Having a static, immovable belief system about something like copyright that is unaffected by seismic shifts in the real world also doesn't seem very logical.

If like, Disney did a 180 overnight and bought rights from Google to scan every writer's saved work in Docs with some flimsy legal argument then a person saying "wait doesn't copyright actually protect that" would make sense. Even if you were previously upset about them suing schools for using 80 year art.

by toraway

6/19/2026 at 4:06:15 AM

Sure. So you’re saying MPAA was right and you’ve come around?

Creative works have always been accretive. There had never been a creative work made out of whole cloth, with no debt to any previous work.

The fact your opinions about creative works change based on who’s profiting does not change that.

by brookst

6/17/2026 at 10:43:13 PM

Reasoning models can coaxed to reason like they do in dedicated reasoning blocks, outside of those blocks: in normal parts of the response.

But Anthropic at least has openly admitted they try to detect that and interfere

by BoorishBears

6/17/2026 at 2:17:16 PM

Supposedly there are “jailbreaks” that expose considerably more of the thinking traces.

by ComputerGuru

6/18/2026 at 6:53:26 AM

Simple trick: Use an agentic tool like Pi or OpenCode that allows you to switch models. First do some chats with DeepSeek or GLM who shows full thinking traces, then switch to Claude or GPT and it's more likely to show full thinking traces.

by woctordho

6/17/2026 at 7:51:06 PM

I don’t understand why there isn’t public dataset for reasoning that can be improved by humans/llms like Wikipedia (ie with auto judging contributions etc).

by mirekrusin

6/18/2026 at 6:47:18 AM

There is already a lot of effort to collect agent traces including reasonings, e.g. see the recent discussion: https://old.reddit.com/r/LocalLLaMA/comments/1u795pb/donate_...

We've been developing DataClaw for this: https://github.com/peteromallet/dataclaw

by woctordho

6/19/2026 at 9:22:11 AM

Did I get it wrong or the first link has dataset with 30 entries only?

by mirekrusin

6/17/2026 at 8:21:48 PM

For reasoning a manually-curated dataset is too small; you need to be able to automatically generate vast volumes of synthetic reasoning data with provably correct answers. That's presumably why Claude and GPT are so good at using Lean (the theorem prover), because they get fed a bunch of synthetic, verifiably correct training data.

by logicchains

6/18/2026 at 1:30:49 PM

Wikipedia is a lot of data as well but we manage to do it, no?

by mirekrusin

6/17/2026 at 8:20:50 PM

You can trivially leak the CoT of any current model, it's not a problem.

>outrageous copyright infringement

>unethically scrapped data

Hahahahaha

by orbital-decay

6/17/2026 at 5:37:37 PM

> It seems to really be a nice step-up and is getting quite close to the frontier.

IMHO it's already surpassed them. I vastly prefer my personal GLM and OpenCode setup to the Claude Code and Opus one that I have to use at work. The former makes way fewer StackOverflow brogrammer-tier mistakes and is considerably better at following instructions. The harness UX is also vastly superior as it doesn't ignore, randomly change, or incorrectly report settings.

Maybe it's the harness and I'd have even greater success with OpenCode and Anthropic, but I think it safe to say that Anthropic's moat is evaporating.

by alexjplant

6/18/2026 at 7:46:43 PM

You would be surprised at how much of an impact the harness has. I switched to Pi and chinese open source models, and models that _I know_ are less capable than sonnet outperform my sonnet + claude code stack at work.

by carter2099

6/17/2026 at 11:00:32 AM

This is a problem I find with opus is will spend so long thinking then going “but wait what if”

To point where I stop it and simple tell it to “start writing code you can work it out as you go along”

Seems writers block also effects LLM

by vorticalbox

6/17/2026 at 2:11:35 PM

https://arxiv.org/abs/2606.00206

In this paper they nerf an LLMs ability to emit waffling thinking tokens like "wait", "but", "alternatively", and the models (they're old, small models in the paper) terminate reasoning faster and perform better. I bet Anthropic is tuning this on their backend.

by robertkarl

6/18/2026 at 2:33:16 AM

Didn't they originally introduce those tokens to make the models smarter by second guessing their "thoughts"?

by addandsubtract

6/17/2026 at 5:24:16 PM

This is super cool. Do you know if any of the inference backends (llama.cpp, vllm, etc) support this technique?

by meatmanek

6/17/2026 at 10:12:25 PM

vLLM supports "banning" certain tokens but I don't know if it can dynamically reduce them.

To my knowledge you can also "ban" with llama.cpp but it is passed in the API call rather than to the server at initialization.

by iaw

6/17/2026 at 10:11:50 PM

I imagine Anthropic would rather train a small control model instead of resorting to sampling hacks

by orbital-decay

6/17/2026 at 12:26:00 PM

I usually have Claude build a plan first, then I put it into an XML file it updates with phases, usually we talk about some of those tasks, and then once its good and I like it, I have Claude implement the plan.

Another thing I tell Claude to do is to not guess, but look at documentation, it messes up a lot less, might use some tokens reading docs, but at least it has a higher success rate code wise.

by giancarlostoro

6/17/2026 at 12:56:15 PM

XML??

by xstas1

6/17/2026 at 1:08:10 PM

Apparently because of how Claude is trained, even the system level prompts go through as XML, it works better with XML "prompting" so I figured I could have it write plans in XML. I need to update my ticketing tool to output XML maybe by default.

https://www.reddit.com/r/ClaudeAI/comments/1psxuv7/anthropic...

by giancarlostoro

6/17/2026 at 1:48:49 PM

Comments later in thread say markdown works just as fine and that it’s more important to organize your plan into sections.

Also just think about it, why would a model trained on the world’s corpus of text (that isnt formatted in xml) perform better with XML? It would be a better study if that post tested markdown, org, xml, json, etc. 10 times to see if their is a difference

by saltsucker

6/17/2026 at 3:53:09 PM

Anthropic’s best practices still include the use of XML: https://platform.claude.com/docs/en/build-with-claude/prompt...

by swingboy

6/17/2026 at 3:08:56 PM

A year or so ago XML worked more reliably for long-lived prompt instructions. Now it is cargo culting.

by adastra22

6/17/2026 at 10:15:03 PM

XML consistently performed better than markdown and JSON in all evals I've ever seen on any model, except for a couple very specific ones.

by orbital-decay

6/17/2026 at 8:59:57 PM

One reason to use XML-like formatting is that it makes the beginning and end of sections explicit. This is less of an issue when the model is generating text but can still be helpful when using templated prompts.

by aesthesia

6/17/2026 at 1:33:47 PM

XML stands for Xtra ML....

by root-parent

6/17/2026 at 3:23:07 PM

I'd like to switch to a sales career--can you give me any pointers?

by noworriesnate

6/17/2026 at 11:51:57 AM

Seriously. Whenever I read the thinking output I get mad and turn down effort to medium or low.

Just output the code and we’ll work through it!

I feel similarly about having codex review claude’s plans. I don’t think I’ve ever seen it catch a major issue. It just points out things that would have inevitably been addressed during implementation anyway.

by mikeocool

6/17/2026 at 4:18:59 PM

A lot of times this is how humans work. Just start 'putting words on paper', 'think by doing', etc. sometimes it's more efficient to see why something won't work after writing a bit of it, and sometimes you get lucky and it works right off the bat

by SubiculumCode

6/17/2026 at 11:14:23 AM

Fable was 20 times worse on that.

It's clear it was the vibe coding model, as like no other model before, fully turned you into his assistant instead of the other way around.

by epolanski

6/17/2026 at 11:56:54 AM

Could it be possible, these firms are optimizing for two things: a) Better performance. b) Gathering data from you to further improve performance later. I've also found the huge amount of planning rather than iteration frustrating. I've felt like I'm teaching a junior!

by RyanHamilton

6/17/2026 at 12:03:09 PM

I think they simply optimize around E2E benchmarks, none of those benchmarks is designed as multi turn assistance to the user, but going from a prompt straight to the final solution.

by epolanski

6/17/2026 at 6:11:26 PM

Exactly. How can "we" develop and encourage benchmarks for multi-turn user assistance? That is what I want. I feel like the models and harnesses push much too hard against this workflow -- that they push you towards letting go and vibe coding, with only your discipline (and desire for a quality and maintainable product) holding it back.

by celrod

6/17/2026 at 1:54:51 PM

more thinking == more tokens === more money LOLL

by happyPersonR

6/17/2026 at 3:53:49 PM

Os there a cost benchmark out there? I wonder how frontier models are doing over time for cost per problem solved.

by overfeed

6/17/2026 at 4:20:08 PM

I think they are optimizing for one-shot performance because that will drive usage. They can’t afford to look bad in the benchmarks. And if that means consuming an order of magnitude more tokens, well, that’s good for business, too.

by drob518

6/17/2026 at 4:16:51 PM

Qwen is notorious for this, too. It’ll sometimes spin in a long loop of “But wait…” paragraphs.

by drob518

6/17/2026 at 11:56:57 AM

I've been having success with Opus but you REALLY have to tame it. Long prompts that list what files to look at, relationships between entities, etc... I went from regularly hitting my daily limit to almost never hitting it. Oh, and also I was being lazy with small changes and stopping that helped a lot too. As you said, it gets in these loops where it's just churning and if you don't stop it it can go on for way too long.

by thinkingtoilet

6/17/2026 at 12:57:04 PM

Hopefully the recent work Moonshot did with Kimi K2.7 Code trickles in to the other open-model labs.

Per AA, while K2.7 Code is roughly on par w/ K2.6 in terms of intelligence, it uses half the output tokens to get there.

by h14h

6/19/2026 at 3:51:25 AM

I've been doing some testing with GLM 5.2 on Fireworks and it looks like the "High" reasoning level uses fewer tokens than even K2.7 Code by a considerable margin (roughly half).

Don't have any evals indicating how it compares on upper-bound quality, but for a well-defined task it seems like GLM 5.2 on "High" is remarkably token efficient. Looking forward to seeing where it lands on the AA index.

by h14h

6/17/2026 at 10:37:55 AM

This is GLM 5.2 Max. GLM 5.2 High which use less than half[1] the tokens.

[1] https://z.ai/blog/glm-5.2

by bertili

6/17/2026 at 10:39:35 AM

Yes, but the Artificial Analysis result is also from GLM 5.2 (max), not high.

by Tiberium

6/17/2026 at 10:57:22 AM

They have this with a lot of models, measuring only the max setting, while the one you'd actually want to use for most tasks is much lower.

by andai

6/17/2026 at 11:15:26 AM

For the brief period with had Fable, I never had to use it above medium.

Low nailed the overwhelming majority of mundane tasks on it's own, medium was good for more complex stuff.

by epolanski

6/17/2026 at 11:58:03 AM

> Of course if you convert those values to actual request cost, GLM 5.2 will probably beat GPT 5.5/Opus 4.8, but speed matters for a lot of people, I think.

GLM5.2 ends up being far more expensive than I thought it would be when I tried it on openrouter. I ground through $5 USD worth of tokens quite quickly.

And this was high, not max.

by cmrdporcupine

6/17/2026 at 9:39:06 PM

Using these open models really makes you realize how subsidized Anthropic and OpenAi's subscription plans are.

by guelo

6/17/2026 at 11:26:11 PM

Absolutely. You can also run codeburn or ccusage and they'll scan the session files and tell you how much you burnt in API token pricing equivalent.

by nijave

6/18/2026 at 4:12:19 PM

Agreed that models should get better at working with rare programming languages like Nim! Using them tends to confuse agents a lot in general. We're working on a paper right now where we compare how token-efficient models are when trying to implement the exact same program in different programming languages, and that's one of the trends we're seeing.

by abgruszecki

6/17/2026 at 2:56:46 PM

I agree. I've noticed that it is quite smart but it has a tendency to doubt itself and overthink. I monitor its internal dialogue and prod it when it does this. They need to optimize the chain of thought early stopping.

by esafak

6/17/2026 at 12:40:02 PM

That's interesting. I gave nearly the same task to Gemma4 31b as a test yesterday. Write a symbolic math engine in Typescript that can perform evaluation and simple expression reductions over +-/*(). It performed the task correctly with minimal reasoning - much fewer reasoning tokens than output tokens.

by robmccoll

6/17/2026 at 3:06:11 PM

Tbh, so what? I googled "symbolic math engine in Typescript that can perform evaluation and simple expression reductions over +-/*()" and got what looks to be viable answers without using any AI model at all. Reciting well established things from memory isn't terribly interesting. Show it a novel codebase and have it implement something within it.

by gbingles

6/17/2026 at 4:28:51 PM

TBH, while your point is a fair one, your attitude is off-putting and needlessly condescending.

by SubiculumCode

6/17/2026 at 4:22:50 PM

So, a natural question would be why a model would ever get it wrong?

by drob518

6/18/2026 at 12:02:40 AM

Reminiscent of https://en.wikipedia.org/wiki/Portia_(spider)

by xyzsparetimexyz

6/17/2026 at 1:26:34 PM

As per stats in other comments, it is frontier, not close to frontier.

by rdsubhas

6/18/2026 at 4:49:50 AM

I thought you could not compare tokens across models because their cost and speed was so different between models.

by HWR_14

6/17/2026 at 11:54:36 PM

You asked for maximum effort, you got maximum effort

by nurumaik

6/18/2026 at 6:28:42 AM

[flagged]

by gateonai

6/17/2026 at 12:07:33 PM

I have a script that ranks these based on codingindex from Artificial Analysis.

All it does is pull a json from their main table page and parses it with the fields I care about (coding).

There used to be a mailing list associated with it but eh ... there wasn't much interest. I use the script every day though.

Current partial output

  score  age  size name
  47.1   58  large Kimi K2.6
  47.5   54  large DeepSeek V4 Pro (Reasoning, Max Effort)
  47.5   70    -   Muse Spark
  47.6   132   -   Claude Opus 4.6 (Non-reasoning, High Effort)
  47.8   205   -   Claude Opus 4.5 (Reasoning)
  48.1   132   -   Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  48.6   55    -   GPT-5.5 (Non-reasoning)
  48.7   188   -   GPT-5.2 (xhigh)
  50.1   29    -   Qwen3.7 Max
  50.7   1   large GLM-5.2 (max)
  50.9   120   -   Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  51.5   92    -   GPT-5.4 mini (xhigh)
  52.1   55    -   GPT-5.5 (low)
  52.5   62    -   Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  53.1   132   -   GPT-5.3 Codex (xhigh)
  53.1   62    -   Claude Opus 4.7 (Non-reasoning, High Effort)
  55.5   118   -   Gemini 3.1 Pro Preview
  56.2   55    -   GPT-5.5 (medium)
  56.7   20    -   Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  57.2   104   -   GPT-5.4 (xhigh)
  58.5   55    -   GPT-5.5 (high)
  59.1   55    -   GPT-5.5 (xhigh)
  62     8     -   Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)

To see everything, run it like so

  $ curl day50.dev/art-analysis.sh | bash

The repo: https://github.com/day50-dev/aa-eval-email

some key takeaways:

* open models are on about a 4-7 month lag right now depending on how you want to measure it

* if this keeps up, you might see an open-weights model doing claude fable 5 level work before the new year.

if people sign up for the free mailing list (that just does this) I'll go and put it back on ... emails when new model evals drop - it was pretty useful.

by kristopolous

6/17/2026 at 12:42:23 PM

  score  age  size   name
  62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  59.1   55   -      GPT-5.5 (xhigh)
  58.5   55   -      GPT-5.5 (high)
  57.2   104  -      GPT-5.4 (xhigh)
  56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  56.2   55   -      GPT-5.5 (medium)
  55.5   118  -      Gemini 3.1 Pro Preview
  53.1   132  -      GPT-5.3 Codex (xhigh)
  53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  52.1   55   -      GPT-5.5 (low)
  51.5   92   -      GPT-5.4 mini (xhigh)
  50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  50.7   1    large  GLM-5.2 (max)
  50.1   29   -      Qwen3.7 Max
  48.7   188  -      GPT-5.2 (xhigh)
  48.6   55   -      GPT-5.5 (Non-reasoning)
  48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  47.8   205  -      Claude Opus 4.5 (Reasoning)

by papersail

6/17/2026 at 12:54:03 PM

Short comments...

- GPT 5.5 consistently the best, an opinion who gets me constant downvotes here by the Anthropic Marketeer strike force...

- China is going to eat the US lunch on AI

- What have European universities and companies been doing? Its like if, on a parallel past/future, Nikola Tesla and Edison would have created flying Cyberpunk machines, while Europeans researchers, would be getting together to request EU funds, for investigation on how to breed faster horses.

- If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?

by tcp_handshaker

6/17/2026 at 1:08:07 PM

None of these models come from universities, European or otherwise.

Mistral is clearly currently not competing for Frontier Model. Whether this is due to a lack of VC Funds or a lack of technical ability or the former arising from the latter would be interesting to know.

The top models are from startups. Among the FAANG only Google managed to get a Frontier model, and they litterally invented the architecture and have more money than they can possibly spend to throw at the problem. Facebook shows that even ungodly amounts of money don't get you there though.

So why did no EU based Startups succeed while two US start ups succeeded? I agree that that's a very important question the EU should ask. The Internet revolution was driven by US companies, and now AI will be as well, with Chinese Open Weights mixed in. The EU consistently can not turn its considerable economic output into fast moving tech firms.

by Certhas

6/17/2026 at 2:13:25 PM

Mistral have moved to actually trying to make money, and been relatively successful; at least if we lived in a normal world.

They've got a heap of contractors working to help industry adopt LLMs. It is just classic consulting work, and they'd look like a really great company if we weren't comparing them to literal $2T+ companies losing money hand-over-fist...

by Quarrel

6/17/2026 at 1:28:15 PM

Apertus was built by universities in Switzerland. Although not frontier it is fully open.

[1] https://apertvs.ai/pages/about/

by sschueller

6/17/2026 at 1:17:47 PM

I'm actually more curious about IBM. Their granite series appears to be nowhere close to competitive.

They had Watson, remember, it won on jeopardy like 15 years ago? They've been at this for a long time

Maybe it's good at something else?

by kristopolous

6/17/2026 at 1:39:56 PM

IBM doesn't do technology they do contracts. Any "technology" is marketing stunts. They hire a bunch of "fellows" outside contractors to make a thing they can be first at or whatever, do the stunt, then get a bunch of 5-10 year contracts with customers off the stunt. They then fuck it up for that length of time but still get paid due to those contracts. After that space of time the folks theyve burned have moved on, rinse repeat. Pretty easy to look back at the timeline of "firsts" they have and see the pattern.

by tekchip

6/17/2026 at 3:43:26 PM

Don’t forget the marketing for the new $1B “initiative” (fill in: mobile, cloud, blockchain, AI,…)

Upon closer inspection the $1B is (a) over 10 years, (b) mostly internal cross-billing between departments.

by JSR_FDED

6/17/2026 at 4:47:43 PM

Yes, but the key point is that nobody got fired for buying it from IBM.

by drob518

6/19/2026 at 2:03:34 AM

"HAL, I want you to train a frontier-level large language model for me."

"I'm sorry Dave, I can't do that"

by tanseydavid

6/17/2026 at 1:27:42 PM

Agree that IBM has no excuse. Specially for how long they have been trying to do AI. Although Watson was a completely different technology.

They had to start from scratch, but dont seem to have the management to be smart enough, to stop doing it in house. They could have just acquired a startup that could build a frontier model.

What is also very ironic since their whole bussiness for the last 15 years, has been buying companies a la CA Associates...

Their previous Watson branding and collapse of Watson expectations cost them one CEO, but the current CEO was part of the same team. They just dont learn....

by root-parent

6/17/2026 at 3:09:49 PM

I view Watson in the same light as Deep Blue, one-offs that brought more prestige and potential share value to IBM than necessarily "moving the needle" in the respective technology.

by vunderba

6/17/2026 at 1:25:48 PM

Granite is OK for speech to text (ASR)

by greenavocado

6/17/2026 at 1:41:13 PM

To be honest, living in Switzerland and speaking with peers, we're just exhausted by the constant AI hype. For a lot of us, the fact that Europe isn't frantically trying to scrape the entire internet and every book in existence for the next massive model isn't a bad thing. The big players are doing their thing, like with the nuclear arms race. We regulate a lot, too much a lot of the time, but sometimes that trickles down to other places too. A lot was done right, imo.

ETH Zurich and EPFL universities recently put out an open model called Apertus (was on the HN front page a few months back), it's not a frontier model, but they built it properly regarding copyright and data transparency.

It might look a bit slow or old-fashioned, but focusing on doing things ethically and legally feels like a much better path than just joining the race to scrape everything.

by marcus_cemes

6/17/2026 at 1:54:48 PM

Sir, I would suggest that if Europe fails to be economically competitive, the downstream implications on European society will produce much worse outcomes than (for instance) data transparency…

Doing things with ethical intentions does not necessarily produce outcomes that are beneficial for society at large.

by dr_dshiv

6/17/2026 at 2:40:01 PM

I'm inclined to agree with you, but you could make the same argument for exploiting natural resources and the environment. I don't think it's being done right at the moment, and it does not seem to be benefiting people as much as certain companies.

by marcus_cemes

6/17/2026 at 7:03:11 PM

Well, is this mad dash for AI producing "outcomes that are beneficial for society at large" yet? So far it looks like its mostly producing a ton of negative externalities and wealth transfer to corrupt elites.

Also, no, abandoning ethics is not an option, what a ridiculous suggestion.

by muvlon

6/17/2026 at 7:12:11 PM

Data transparency and copyright does not constitute “ethics.”

by dr_dshiv

6/18/2026 at 9:33:28 AM

also living in Swizerland and I disagree. Hard.

it's horrible that Europe is so backwards in AI. too much regulation and nothing to show for it. we should be way faster.

there is no money. the culture in both Europe and Switzerland is that you don't fail, while in the US it's perfectly fine to be on your 4th startup because the first 3 failed.

it's not that it LOOKS slow and old fashioned, it IS slow and old fashioned. it's horrible.

by _zoltan_

6/17/2026 at 9:29:14 PM

If these models ever reach the point where they are as good a programmer as a human is (and thus can self-improve completely independently), then there won't be an independent Switzerland much longer. AI race is a race for first place.

> like with the nuclear arms race

MacArthur was about to nuke the Chinese in the Korean war. China knows that nuclear weapons, AI and robotics are a matter of survival and not a nice-to-have.

by tsss

6/17/2026 at 3:00:26 PM

[flagged]

by tw1984

6/17/2026 at 4:12:11 PM

You seem to be confusing Hacker News with 4chan.

by OKRainbowKid

6/17/2026 at 6:45:38 PM

> - If Zuckerberg could be fired, after spending a total of $235 billion on AI and having NOTHING to show for...should he be fired?

Yes, if the premise was true but it’s not.

https://opper.ai/ai-roundtable/questions/bbf5a4e9-204

by wunderlotus

6/17/2026 at 10:17:44 PM

Interesting...but this shows how dumb these AI are.

And they misunderstood nothing to show for as...literally nothing to show for. Yes not factually but he has nothing effectively not much that is competitive to show for so its literally true.

And had they been give this clarification then would have suddenly said: "Oh yes of course, you are absolutely right, you are correct on challenging me on that...."

by tcp_handshaker

6/17/2026 at 1:41:31 PM

Well Europe is famously a laggard when it comes to new tech - in parts of Switzerland, two horses were required be mounted in front to carry cars up until 1925. UK required a person to walk in front of a car and wave a red flag.

by ricardobayes

6/17/2026 at 1:06:40 PM

They did muse spark ... it's not garbage.

Also what are they building it for? I'd think it's to serve ads better or something like that. Maybe Muse Spark fits facebook's needs perfectly...

by kristopolous

6/17/2026 at 1:29:35 PM

Mo Bitar said something like "Meta's LLM is the one you use if you accidentially hit the wrong button in WhatsApp. Its user base is fat-finger phone users."

by jansan

6/17/2026 at 1:49:53 PM

Understood - they're just doing other things. Maybe custom ad rewriting for a target audience or some kind of deep analytics insight into user behavior or translations that optimizes for maximizing purchasing habits over literary accuracy ... I'm just saying their incentives are elsewhere and maybe Muse is serving them well.

I mean that is the smart move here. Focus the model on optimizing the core business. For Meta, that's not coding tools.

by kristopolous

6/18/2026 at 8:07:00 PM

As comparison the WHOLE NASA budget is 24 billion. Meta burned 10x that on AI...

by tcp_handshaker

6/17/2026 at 1:46:29 PM

> China is going to eat the US lunch on AI

They will forever have superior weights?

by applicative

6/17/2026 at 2:12:39 PM

I would imagine it will be a fundamental breakthrough, not weights alone, that are going to usher in the next generation of AI. Perhaps China will in fact make that breakthrough. They certainly seem to have a lot of eyeballs in the field right now.

by JKCalhoun

6/17/2026 at 3:14:02 PM

I think they are already massively winning on efficiency... which is about to matter a lot as the frontier models jack up their prices in order to some day see a profit (and no, Anthropic getting massively subsidized by Elon out of spite doesn't count for long term profits).

by rapind

6/17/2026 at 5:18:51 PM

There has really been one break-through, the actual construction of giant LLMs from the available titanic corpus of text. Even that barely involved much conceptual breakthrough, a few things maybe e.g. transformer. Basically it was a question of the accessibility of a) giant internet corpus of actual people actually saying stuff and b) adequate computing power. The witty surface training, the scaffolding for a chatbot is what made a universal stir. With this, though, we are done with revolutionary breakthroughs. Training for coding involves actual alteration of weights - and as it improves the general utility of the corresponding models will fail. In the end it will be a domain of specialized models. The improvement of this aspect via RLVR etc is what caused a general mania in the programmer milieu.

There is a lot of money in pretending that we are seeing unending revolutions.

by applicative

6/17/2026 at 2:10:16 PM

"…Anthropic Marketeer strike force…"

Might also just be the result of "good will" (that the company has deftly fostered). Other companies might learn from Anthropic in that regard.

by JKCalhoun

6/17/2026 at 3:45:57 PM

“Good will” is easier if OpenAI is your yardstick

by JSR_FDED

6/17/2026 at 4:20:32 PM

As evil as Google is as a company these days [cough disclaimer, used to work here, so biased] I can't help but think that if Gemini didn't... suck, and if they had a coding model at the same quality as GPT 5.5 or Opus 4.8 they'd be completely cleaning up purely on the basis of relative reputations of the companies.

That Google is dropping the ball so badly, or just disinterested in the coding side of things... is either a sign of incompetence, or a lack of interest in losing money in that space. I wish I knew which.

by cmrdporcupine

6/17/2026 at 1:38:12 PM

I downvoted you for your complaining about downvotes fwiw.

And Zuck hasn't spent that much on AI yet. Half of that is projected spending for 2026.

As to whether it's all for nothing, Q1 2026 revenue was up 33% over Q1 last year, driven largely by...better AI-driven ad targeting. So the spending doesn't seem that crazy to me.

by senordevnyc

6/17/2026 at 3:30:26 PM

I also get the downvotes for the GPT thing, and agree with you about 5.5's quality, but TBH I don't think it's Anthropic marketing as just two other things:

1. SamA and his company has a well-deserved bad reputation and Anthropic got some early good PR for basically not being SamA.

2. Claude Code got early head space, Boris and crew basically "invented" this kind of agent, and so has first mover advantage despite its known reliability and cost issues.

3. Most people I talk to haven't even tried Codex for some reason

Also it's uncool to complain about downvotes.

by cmrdporcupine

6/17/2026 at 3:41:37 PM

Lol thank you for sorting.

Are the scores here normalized such that each point difference is equidistant?

by christoff12

6/17/2026 at 3:21:48 PM

  rank  score  age  size   name
  1     62.0   8    -      Claude Fable 5 (Adaptive Reasoning, Max Effort, Opus 4.8 Fallback)
  2     59.1   55   -      GPT-5.5 (xhigh)
  3     58.5   55   -      GPT-5.5 (high)
  4     57.2   104  -      GPT-5.4 (xhigh)
  5     56.7   20   -      Claude Opus 4.8 (Adaptive Reasoning, Max Effort)
  6     55.5   118  -      Gemini 3.1 Pro Preview
  7     53.1   62   -      Claude Opus 4.7 (Non-reasoning, High Effort)
  8     53.1   132  -      GPT-5.3 Codex (xhigh)
  9     52.5   62   -      Claude Opus 4.7 (Adaptive Reasoning, Max Effort)
  10    51.5   92   -      GPT-5.4 mini (xhigh)
  11    50.9   120  -      Claude Sonnet 4.6 (Adaptive Reasoning, Max Effort)
  12    50.7   1    large  GLM-5.2 (max)
  13    50.1   29   -      Qwen3.7 Max
  14    48.7   188  -      GPT-5.2 (xhigh)
  15    48.1   132  -      Claude Opus 4.6 (Adaptive Reasoning, Max Effort)
  16    47.8   205  -      Claude Opus 4.5 (Reasoning)
  17    47.6   132  -      Claude Opus 4.6 (Non-reasoning, High Effort)
  18    47.5   70   -      Muse Spark
  19    47.5   54   large  DeepSeek V4 Pro (Reasoning, Max Effort)
  20    47.1   58   large  Kimi K2.6
  21    47.1   29   -      Gemini 3.5 Flash (minimal)
  22    46.7   449  -      Gemini 2.5 Pro Preview (Mar' 25)
  23    46.5   211  -      Gemini 3 Pro Preview (high)
  24    46.5   16   -      Qwen3.7 Plus
  25    46.4   120  -      Claude Sonnet 4.6 (Non-reasoning, High Effort)
  26    45.6   5    large  Kimi K2.7 Code
  27    45.6   104  -      GPT-5.4 (low)
  28    45.5   56   large  MiMo-V2.5-Pro
  29    45.1   43   -      GPT-5.5 Instant (May 2026)
  30    45.0   29   -      Gemini 3.5 Flash (high)
  31    44.9   58   -      Qwen3.6 Max Preview
  32    44.7   216  -      GPT-5.1 (high)
  33    44.2   188  -      GPT-5.2 (medium)
  34    44.2   126  large  GLM-5 (Reasoning)
  35    43.9   92   -      GPT-5.4 nano (xhigh)
  36    43.4   71   large  GLM-5.1 (Reasoning)
  37    43.4   16   large  MiniMax-M3
  38    43.2   54   large  DeepSeek V4 Pro (Reasoning, High Effort)
  39    43.0   188  -      GPT-5.2 Codex (xhigh)
  40    42.9   76   -      Qwen3.6 Plus
  41    42.9   205  -      Claude Opus 4.5 (Non-reasoning)
  42    42.6   182  -      Gemini 3 Flash Preview (Reasoning)
  43    42.2   99   -      Grok 4.20 0309 (Reasoning)
  44    42.1   56   large  MiMo-V2.5
  45    41.9   91   large  MiniMax-M2.7
  46    41.4   91   -      MiMo-V2-Pro
  47    41.3   121  large  Qwen3.5 397B A17B (Reasoning)
  48    41.0   48   -      Grok 4.3 (high)
  49    40.5   71   -      Grok 4.20 0309 v2 (Reasoning)
  50    40.5   342  -      Grok 4
  51    39.8   54   large  DeepSeek V4 Flash (Reasoning, High Effort)

A longer curated list based on kristopolous’ list, with more models included. For each model, I kept only the two highest-scoring entries. I used DeepSeek V4 Flash as the cutoff, since I consider it the lowest acceptable model that is still locally deployable.

by papersail

6/17/2026 at 3:37:25 PM

My observations:

Surprised to see MiniMax M3 so low on that list, not really my experience, I found it smarter than Gemini for a lot of things, that's for sure.

Also surprised to see Gemini 3.1 ranked that high there. It remains IMHO blatantly incompetent for tool use even in their own harnesses, so I can only assume this benchmark isn't ranking workflow things very high. Gemini can write code just fine. It just can't work well as an agent.

GLM 5.2 and Qwen3.7 max were from my experience fairly expensive to use on a per token price and hard to argue in favour of when the SOTA coding plans have a fixed price that makes them potentially more cost effective. (Yes I know z.ai has a coding plan but I've heard reliability nightmare stories, and it's not very cheap)

DeepSeek is clearly the best value for $$. With the right harness and prompting.

by cmrdporcupine

6/17/2026 at 7:54:55 PM

These results are amazing! I can't believe an open weight model rivals Opus 4.6, my most used model!

by matheusmoreira

6/18/2026 at 4:57:35 AM

That list also places Sonnet 4.6 above Opus 4.6, which doesn't match my experience.

by celrod

6/17/2026 at 1:18:28 PM

you left some models out like DeepSeek and Kimi, for example.

by bel8

6/17/2026 at 1:42:58 PM

It was a truncated output from the script to demonstrate what it does ...

If you really want to see all of them:

https://day50.dev/output.txt

Or run the script

by kristopolous

6/17/2026 at 1:41:14 PM

Because it's not in the top 20 in their benchmark, it's at #23

by ashenke

6/17/2026 at 12:26:03 PM

Consider using decrementing score order (best on top)

by alecco

6/17/2026 at 12:48:05 PM

then I'd have to scroll up over 500 lines after running it every time to see what I care about.

But if that's your thing, here you go: https://github.com/day50-dev/aa-eval-email/commit/1853be6461...

add an argument (any argument) and it will be sorted as your specified. It just works as a toggle flipping the order ... so literally any string will do.

The original link has been updated accordingly with the new code.

by kristopolous

6/17/2026 at 12:58:35 PM

Have it print paginated or just top 10?

by datadrivenangel

6/17/2026 at 1:11:07 PM

only the small ones:

  $ ./art-analysis.sh | grep small

or maybe just the qwen

  $ ./art-analysis.sh | grep Qwen

only the ones in the past 30 days

  $ ./art-analysis.sh | awk '$2 < 31'

I use it in pipes like this.

by kristopolous

6/17/2026 at 12:50:33 PM

[dead]

by spwa4

6/17/2026 at 6:16:27 PM

Note that AA's coding index is only made up of two benchmarks: Terminal-Bench Hard and SciCode. I'm skeptical that it makes a good coding index. It ranks Gemma 4 31B above Deepseek V4 Flash. Having used both of those models for a broad variety of coding tasks I would choose Deepseek every day.

by sosodev

6/17/2026 at 1:17:43 PM

Cool project! Side note: Kind of a bad practice imo to ask people to blindly execute bash from an unknown source.

by bodhi_mind

6/17/2026 at 12:28:04 PM

Thanks for sharing. I'm curious: why didn't you sort with the score descending?

by slig

6/17/2026 at 12:41:17 PM

Because it's currently 511 lines. Why would I want to scroll up to see the stuff I care about? Don't you want the relevant stuff to be right there in front of you?

by kristopolous

6/17/2026 at 1:01:23 PM

I do and that's why I pipe the output to `head -n 20` or use `LIMIT 20` in SQL.

That aside, this is a good script you're running. Thanks.

by duckmysick

6/17/2026 at 1:12:31 PM

But maybe you decide you want to see more. It makes perfect sense for a cli tool to output the most interesting piece of info last: then you can decide on the fly whether you want to scroll up or not.

by tasuki

6/17/2026 at 12:40:23 PM

Not OP but if you run this from the CLI it does make the ordering make a little more sense

by fridder

6/17/2026 at 1:07:13 PM

Because programmers can’t figure out how to have a CLI that prints in a normal order, with the newest stuff on top instead of on the bottom.

Setup a fresh new large monitor. Open CLI. Run command. Watch output at the bottom of your screen. Keep watching the bottom of your screen for the rest of the day.

Sure you can tile windows and it helps but come on. Just have the command/input section in the bottom and the “output” on top. Keep the command bit on the bottom.

by snsnbsne

6/17/2026 at 8:36:35 PM

Seems legit. My experiments with GLM-5.2 so far have resulted in strange hallucinations in the tiniest of places. Like a wrong variable name.

It seems like it's up for the task of complex code, but those little paper-cuts are scary to me. I wouldn't trust this model for anything remotely serious.

by jarjoura

6/17/2026 at 2:40:54 PM

Would be interesting to see where gpt 5.5 pro extended is.

by scrollop

6/17/2026 at 4:24:58 PM

Maybe your script could sort based on score.

by drob518

6/17/2026 at 7:45:21 PM

[dead]

by OkGoDoIt

6/17/2026 at 10:36:20 AM

Why aren't more people talking about this? It's literally Opus 4.7 quality stupid prices. I know providers who are offering this at unlimited tokens for $50 a month. Some are even offering API rates at 3x lower than the official ZAI api rates which are already like 10x cheaper than Opus. (Crof and Umans btw)

This is a huge blow to Anthropic/OpenAI/Google and a massive win for the rest of the world. The official API prices and speeds mean nothing for open source models.

by unrvl22

6/17/2026 at 10:41:26 AM

Be careful about unofficial providers, a lot of them misconfigure models or stealth quantize them. For a while the difference between Kimi on the official API and most third party providers was 20-40%.

by CuriouslyC

6/17/2026 at 12:36:35 PM

Kimi K2 had a vendor verifier: https://github.com/MoonshotAI/K2-Vendor-Verifier

(there's a table which shows comparison between vendors)

Also, it seems there's a general one as well (for all kimi models?): https://github.com/MoonshotAI/Kimi-Vendor-Verifier

by thehamkercat

6/17/2026 at 10:59:06 AM

OpenRouter should be penalising or banning for this.

by cedws

6/17/2026 at 11:56:19 AM

This is my biggest complaint about OpenRouter and I'm a fan. Might be pretty tough at scale?

by kilroy123

6/17/2026 at 1:28:46 PM

They have an "exacto" category with providers they supposedly verified

by orbital-decay

6/17/2026 at 2:23:22 PM

That’s only for tool use.

by ComputerGuru

6/19/2026 at 9:06:05 PM

That's just one part of it. According to https://openrouter.ai/docs/guides/routing/model-variants/exa...

  We use three classes of signals:
   * Tool-calling success and reliability from real traffic
   * Provider performance metrics such as throughput and latency
   * Benchmark and evaluation data as it becomes available

by rsanek

6/17/2026 at 12:27:59 PM

Would that align with their VC-backed incentives?

by alecco

6/17/2026 at 6:34:44 PM

If your users can't trust your product then I'd say that'd be a pretty strong incentive?

by mrngld

6/17/2026 at 10:43:32 AM

the 2 I mentioned both have a fairly large following, who run benchmarks and absolutely will spot issues.

by unrvl22

6/18/2026 at 2:24:35 AM

[flagged]

by pranavj

6/17/2026 at 10:58:47 AM

> Some are even offering API rates at 3x lower than the official ZAI api rates

Looking at openrouter [1], some of the cheaper offerings are for quantized models. Not sure how much intelligence is lost in quantization. And they are not 3 times cheaper. Where did you find 3x lower prices for APIs? I am considering skipping open router and using them directly for that price.

edit:

I see, croft [2] 8bit for $0.50/$0.08/$2.20

[1]: https://openrouter.ai/z-ai/glm-5.2

[2]: https://ai.nahcrof.com/pricing

by stanac

6/17/2026 at 11:37:01 AM

Neuralwatt ... When you reverse calculate the actual energy usage / price on a token basis, the gap is large.

I do not have GLM 5.2 numbers because the whole default max setting is overkill. But GLM 5.1 numbers had it at 12x cheaper then API rates. And about 2.5x more tokens vs zai their own subscription service.

Yes, its FP8 but lets be honest, do we know for sure that even zai runs at FP16? I learned a long time ago with Claude and Codex how much cheating happens on model levels, even from the big boys.

by benjiro29

6/17/2026 at 4:08:48 PM

Please correct me if you have contradicting data but: Neuralwatt's price per token vs price for energy comparison doesn't seem to take into account the cost savings from cache hits that other providers offer on pure token rates. The comparison seems to assume every input token is a cache miss.

On top of that, the cloud offering doesn't seem that well-run, they randomly blocked a colleague's API key for a couple days without any heads up, had a weird rate limiting bug and they have been deprecating models without redirects with very short notice, all while taking weeks to onboard new models. I assume some of these problems would be addressed if we had an SLA/enterprise contract.

It's a promising idea though. They offer a $5 trial credit (with an aggressive rate limit) though so no harm in trying it out.

by spelk

6/18/2026 at 12:32:06 AM

> doesn't seem to take into account the cost savings from cache hits

Absolute false information.

From my usage panel for this month:

* Total Tokens 1.1B * Cached Tokens 1.0B 97% of prompt tokens * Cost energy pricing $26.58

The energy pricing is higher then what i actually pay because its a mix of token billing and partial subscription (60% extra "power").

From the $50 subscription, i have about 3/4 left (4.21 of 16.0 kWh used this billing cycle). Used $5.5 in token billing.

That was running 82.0% GLM 5.1, and 18% GLM 5.2. Yes, i have been busy ;)

My actual usage if we look in dollar value was ~ $18.

For your information, that is cheaper the MiMo v2.5 Pro from Xiaomi as there i was doing around 450.000t per cent. And they have the same 75% cheaper prices like DeepSeek. MiMo has a issue with cache retention between session prompts what hurts them vs DeepSeek. Yes, DeepSeek v4 Pro is 2.5x cheaper but nowhere near GLM 5.1, and especially not GLM 5.2.

In case your wondering, zai subscription light is about 80m token / week limit. So on a token/cent price, neutralwatt is about 3x cheaper (and not 5h, week limits to maximize/frustrate).

> all while taking weeks to onboard new models.

Took them 1 day to include GLM 5.2 ... Yes, the remove old models fast because they do not have the server capacity to keep old models around.

> I assume some of these problems would be addressed if we had an SLA/enterprise contract.

Its a small team, not a big huge company. From my experience so far, seen a 2 timeouts, and sometimes slow speeds as servers get overloaded. For what i am paying for GLM ~5.1~ 5.2 ...

by benjiro29

6/18/2026 at 3:07:50 AM

Your reply doesn't seem to be in good faith. Please provide your formula for calculating effective per token cost.

I am not sure why the small team argument is relevant. This is a crowded market, there are dozens if hundreds of third party inference providers in the world right now. I'm glad that's a good excuse that works on you but I'm not sure why the average user should care.

by greyb

6/18/2026 at 12:34:12 PM

The formula is very easy. Go to the website of neuralwatt, and read ... 5$ = 1Kwh in power for non-subscription usage. For subscription usage you get ~50% more.

Then you actually use the service and see how much tokens you use on average. You calculate the token use vs what you pay. And this gives you a stable number to compare different services and model with, if you want the token cost. This is basic school level reasoning and calculation.

> I am not sure why the small team argument is relevant.

This is relevant to the previous poster his question regarding support and SLA/enterprise support.

> Your reply doesn't seem to be in good faith.... I'm glad that's a good excuse that works on you ...

Question: Do you have a issue with communicating with other people in real life?

by benjiro29

6/18/2026 at 2:52:45 PM

The irony of questioning someone's communication skills immediately after this exchange is hard to miss.

by greyb

6/18/2026 at 4:59:44 PM

Just asking because it seems there is a issue given your tone and responses. This is out of concern...

by benjiro29

6/17/2026 at 12:09:50 PM

IME, unquantised -> FP8 is pretty much lossless. What matters more is having an unquantized KV cache - using an FP8 KV cache can result in a significant drop in quality.

by scrlk

6/17/2026 at 3:28:37 PM

>unquantised -> FP8 is pretty much lossless

Claude Shannon is rolling in his grave.

by johnnyApplePRNG

6/17/2026 at 7:18:17 PM

I don't know, sounds quite similar to his rate distortion theorem (analyzing minimum number of bits/symbol you need to stay under some fixed amount of distortion). I.e. lossy compression with a maximum amount of loss. I.e. "pretty much lossless" compression.

https://en.wikipedia.org/wiki/Rate%E2%80%93distortion_theory

by gpm

6/18/2026 at 8:25:21 AM

"Pretty much" doing a lot of work. But it's kinda analogous to 99% JPEG compression: yes you can detect loss, but you get meaningful compression ratios out of it and the subjective appearance is nigh-on perfect.

Shannon would be pointing out that if you can throw away half the model without apparent degradation, we're nowhere near packing in all the information we could in training. There must be a better arrangement than we've currently got.

by regularfry

6/17/2026 at 2:23:03 PM

Do infra providers reveal that level of implementation detail?

by ComputerGuru

6/17/2026 at 2:46:02 PM

I've seen a few articles from providers talking about KV cache quantisation, but it's not something they explicitly point out like they do with weights.

So you could end up paying more for unquantised weights, only to get silently hit with a quantised KV cache...

by scrlk

6/17/2026 at 11:25:36 PM

The official API is FP8, which should imply that it's lossless.

by osti

6/17/2026 at 10:48:46 AM

To answer the question in your first sentence - because it's VERY computationally (ha) expensive as a human being to keep up with all the options. It's also very hard to figure out how to run a model like this. There's no installer. If you really really care, which 99% of people do not, you have to google a guide, and then find out it's out of date...

I've tried a number of these, and the learning curve is very steep compared to "install Claude Code and pay $100/mo". There is no way saving me $50/month matters compared to figuring that out.

by Schiendelman

6/17/2026 at 11:01:38 AM

But it just works with Claude Code? They have a guide on their website.

https://docs.z.ai/devpack/tool/claude

Here's my setup. I add this to my .bashrc

export ZAI_API_KEY="your_key_here"

alias claudez='ANTHROPIC_AUTH_TOKEN="$ZAI_API_KEY" ANTHROPIC_BASE_URL="https://api.z.ai/api/anthropic" ANTHROPIC_DEFAULT_OPUS_MODEL="glm-5.2[1m]" ANTHROPIC_DEFAULT_SONNET_MODEL="glm-4.7" ANTHROPIC_DEFAULT_HAIKU_MODEL="glm-4.7" claude'

Then I just run claudez

pro tip the same thing works with deepseek https://api-docs.deepseek.com/guides/anthropic_api

Even more pro tip: Claude Code can set this up for you haha

by andai

6/17/2026 at 11:07:08 AM

Sure, I'm not saying I, a software engineer, cannot do this. I'm saying it's significant onboarding friction.

Unless this were a massive differentiator, people aren't going to be "talking about it" the way GP suggests!

by Schiendelman

6/17/2026 at 11:30:09 AM

You're seriously suggesting that setting up opencode or tweaking your claude code config or etc is too much trouble to be worth saving $50 /mo? That's absurd. Doubly so when the audience in question is already using LLMs so ... just ask your existing LLM for help if it seems daunting.

by fc417fc802

6/17/2026 at 11:35:59 AM

I'm not just suggesting that, I'm trying to be crystal clear: it's a gap that probably cuts TAM by 95% or more. Most LLM users are not software engineers. Even those that are don't care enough to muck with their settings to try out a model. Keep in mind I'm not answering the question "Is this hard to install?" - I'm answering the question "Why aren't people talking about this?"

by Schiendelman

6/17/2026 at 11:48:41 AM

Doesn't pass the sniff test. Casuals messing around already go to far more trouble to set up openclaw or comfyui or what have you.

by fc417fc802

6/17/2026 at 12:06:34 PM

What percentage of "casuals"? ;)

by Schiendelman

6/17/2026 at 12:39:06 PM

"Casuals" just use the web interface from the provider, which Z.ai also has

by neonstatic

6/17/2026 at 1:01:44 PM

I would broadly agree with this (based on years of dealing directly with user-facing UX and setup steps). Small hurdles, even easy ones, create larger barriers to adoption then you’d think.

by donohoe

6/17/2026 at 2:17:42 PM

Thats not absurd. Do you know what software engineers make? Do you know what a Starbucks coffee costs? 50 bucks is nothing for someone in that life.

by ramraj07

6/17/2026 at 2:53:01 PM

> it's significant onboarding friction.

It's crazy that apparently writing software without knowing how to edit a single config file is normal now.

by cromka

6/17/2026 at 5:41:58 PM

For me it's about tolerance. When I was 13, I could and would customize everything, so much that the computer repair shop told my father that their son "likely is a hacker or something".

At 40, I could easily configure claude code to use another model, even if there weren't any official guides with a bit of MITM fun, but I don't want to invest my attention / heavily use something that will most likely break in the near future.

by egeozcan

6/18/2026 at 8:19:35 AM

You can ask the agent to change its own configs. Pi does that by default.

by scotty79

6/17/2026 at 3:36:38 PM

The real question is: should the file be edited in emacs or vim?

by bityard

6/17/2026 at 3:30:59 PM

It's crazy that apparently doing math without knowing how to do long division by hand is normal now.

by johnnyApplePRNG

6/17/2026 at 4:37:47 PM

Absolutely ludicrous comparison

by phainopepla2

6/17/2026 at 8:21:32 PM

Is it, though?

I'd argue math is significantly more complex than language.

by johnnyApplePRNG

6/17/2026 at 5:15:51 PM

Not really, you can literally have Claude set it up for you.

by computerex

6/17/2026 at 12:54:35 PM

The friction is near 0 when you can ask another LLM to set it up for you.

by skeledrew

6/17/2026 at 1:25:33 PM

Here are a few frictions I see that reduce reach, in order:

1) You haven't even heard of it.

2) You have to know to look for both GLM and Z.ai. These are usually in the same article when reporting about GLM is written, at least.

3) You have to understand there could be a benefit in trying it; you have to want to try it for some reason. Their own blog post puts it below Opus 4.8 in each of the three benchmarks they used.

4) You have to figure out the pricing, which isn't obviously in the blog post...

5) When I first went to Z.ai, I got an error popup (not logged in): "You do not have permission to access this resource. Please contact your administrator for assistance." I am using a personal computer...

6) When I typed something in the resultant field and pressed enter, I got "Clear Current Chat? To start a new chat, your current conversation will be discarded. Sign in to save chats"

I think today's article helped with 1 and 2, which helps their top of funnel. But they're fighting a big uphill battle.

by Schiendelman

6/17/2026 at 11:37:58 AM

[flagged]

by chen66996

6/17/2026 at 1:31:00 PM

> There's no installer.

There's ZCode (https://zcode.z.ai). Which is like the Codex App.

That's as "easy" as it is for non-devs that you're complaining about.

by re-thc

6/17/2026 at 2:27:47 PM

I'm not complaining about anything. I'm answering a question.

by Schiendelman

6/17/2026 at 3:30:55 PM

How does it compare to OpenCode? I already have too many LLM CLIs installed :(

by qingcharles

6/17/2026 at 7:39:18 PM

It's also very hard to figure out how to run a model like this. There's no installer.

Yes, there is. It's called Claude Code. Point it at the HuggingFace URL and say "Download these weights and build whatever is needed to run them, then test the model."

by CamperBob2

6/17/2026 at 7:53:07 PM

I really miss the time when people thought that the idea of someone telling an un-sandboxed AI "do whatever is needed to X" was unrealistically stupid.

by PoignardAzur

6/17/2026 at 8:02:05 PM

Skill issue

(In all seriousness, I agree this is a problem. That capability is too powerful not to take advantage of, though. Nobody needs to struggle with this sort of thing anymore, but yes, obviously, it should happen in a VM or at least a container.)

by CamperBob2

6/17/2026 at 1:59:52 PM

install opencode, then either pay $10 for their plan, or add an openrouter api key.

by chillfox

6/17/2026 at 1:51:01 PM

I agree with this.

I'd pay for an out of the box solution. i.e. an Installer with updates

by gerryf2

6/17/2026 at 10:58:37 AM

In my org everyone is extremely Claude-pilled to the point you’d think it’s the only LLM that exists, purely because it caters to non-engineers within enterprises.

by cedws

6/17/2026 at 10:37:08 AM

I cancelled my claude sub after realizing I can burn 300m tokens a day of this quality, for $50 a month.

by unrvl22

6/17/2026 at 4:16:36 PM

Which coding plan are you using? How are you finding it?

by spelk

6/17/2026 at 10:45:24 AM

> Why aren't more people talking about this?

Wasn't this released like 2 days ago? Everyone is still evaluating and playing around with it, things like the submission is just starting to come out. Give it some days at least before jumping to conclusions, ideally weeks.

by embedding-shape

6/18/2026 at 12:30:29 AM

I've tried Chinese open models few times before. They were fine, but they didn't come close to the benchmarks they were claiming.

Now, maybe GLM 5.2 is close to Opus 4.7, but I don't wanna keep checking them and keep finding that they're still benchmaxing and aren't at GPT (my choice) or Opus level. The boy who cried wolf, I guess.

by sinatra

6/18/2026 at 3:10:54 AM

Yes, my experience has been the same as yours. I find that the performance of open models is quite acceptable, even good, at one-off questions or small tasks. But they are quite unreliable at long horizon goals.

by enraged_camel

6/18/2026 at 1:51:54 AM

Which of those providers are:

1. Keeping your data private on in the US

2. Not training on it

3. Not quantizing the model

4. Offer reasonable latency adds rate limits

by shostack

6/18/2026 at 8:41:58 AM

OpenRouter has a list of providers, looks like NovitaAI would meet those criteria. Though not for $50/mth for 80/M tokens, which I assume is the Z.ai subscription pricing.

https://openrouter.ai/z-ai/glm-5.2

https://novita.ai/models/model-detail/zai-org-glm-5.2

by SyneRyder

6/17/2026 at 12:50:39 PM

Isn't it closer to sonnet?

by knollimar

6/17/2026 at 8:14:22 PM

The Chinese open weight models have been ahead of Sonnet (at least for coding) for a couple months now. I tend to take benchmarks with a huge grain of salt, but in my own experience, the latest versions of Kimi, MiMo, and GLM (pre-5.2) had already surpassed Sonnet in terms of output quality for a fraction of the price.

With that said, I'm excited to try GLM 5.2 because I still end up reaching for Opus and GPT 5.5 for many tasks because the open models tend to get stuck more often on complex problems.

by RussianCow

6/17/2026 at 10:38:18 PM

I found sonnet preferable to k2.6 but 2.7 code for kimi seems better anecdotally

by knollimar

6/17/2026 at 12:54:46 PM

Definitely opus level for coding.

by redox99

6/17/2026 at 1:03:34 PM

Do you have benchmarks or at least anecdotes to back that up? I'm not arguing with you; I would just love to see some proof that open models are getting as good as Anthropic's models.

by smith7018

6/17/2026 at 1:30:49 PM

I've been running some test prompts comparing frontier models for webdev, particularly pretty visualizations, physics / orbital simulations, etc.

Do note that GLM is not multi modal, which can be a deal breaker. And these open models are not good outside coding.

by redox99

6/17/2026 at 1:29:13 PM

look at benchmarks, use the model yourself. Im usually first to call BS on every chinese model that says they are as good as Opus. this is finally the first one that actually is. It is a massive jump from every other previous chinese model.

by unrvl22

6/17/2026 at 2:01:49 PM

"use the model yourself"

I wish I had the time to set it up and work on side projects but unfortunately life and work have been crazy (as I'm sure many here feel). That's why I asked for anecdotes about it.

by smith7018

6/17/2026 at 5:01:52 PM

Oic I misremembered OAI scores, I thought Sonnet had 51

by knollimar

6/18/2026 at 2:24:14 AM

[flagged]

by pranavj

6/17/2026 at 10:39:22 AM

I’m not that interested in models that I can’t run on my desktop for ~0€, which is my AI budget.

by Hamuko

6/17/2026 at 11:09:12 AM

Electricity cost seems to be about $30/month for a 32B model on a GPU. It's probably better on Apple hardware.

https://github.com/QuantiusBenignus/Zshelf/discussions/2

Not accounting for hardware, of course :)

by andai

6/17/2026 at 11:20:54 AM

My Mac Studio uses about 60–80 watts whenever I’m running a model (as measured by the system metrics), so it’s less than 2 kWh/day at full blast. Electricity is like 0.125 €/kWh, so that 24-hour period would be <0.25 €.

Not accounting hardware in my costs, since I didn’t buy my hardware for running models. Running models is just something it can do in addition to what I got it for.

by Hamuko

6/17/2026 at 12:19:26 PM

The price, processed tokens, and output can be anything, it just depends on what GPU it is.

Nvidia GPUs are much more efficient than Apple hardware for inference(and training).

by NorwegianDude

6/17/2026 at 10:46:07 AM

Cool beans. You're not the target audience then.

by igravious

6/17/2026 at 10:55:05 AM

Did I claim I was? I just said why I and people like me are not talking about it.

by Hamuko

6/17/2026 at 11:09:26 AM

and he said its cool

by simianwords

6/17/2026 at 11:01:46 AM

> unlimited tokens for $50 a month

link?

> Why

imho everything but opus produces unusable code (fable was even better...), eg gpt5.5 seems to write the absolute worst code that still technically solves the problem; tbh I'd be totally willing to trade "raw intelligence" for "code taste"

more labs need to figure out whatever anthropic did to destroy everybody else on frontiercode bench

by anuramat

6/17/2026 at 1:28:32 PM

Opus has the nickname "Slopus" in a lot of circles for a reason. It can write nice code in isolation, but the way it organizes that code and its rigor in addressing edge cases/making sure things are robust leave a lot to be desired. Opus is particularly famous for having a real problem reinventing stuff that already existed in the codebase because it wanted to get to work before exploring sufficiently.

by CuriouslyC

6/17/2026 at 5:06:18 PM

what you're describing doesn't sound like such a big deal -- it's (A) obvious during review, (B) easy to fix in a single prompt, (C) simple enough to fix manually, (D) can be mitigated with tokenmaxxing (agent review passes, prompting, subagents, etc)

regarding edge cases -- less is more in my experience, as removing is harder than adding

by anuramat

6/17/2026 at 12:19:35 PM

I was surprised that GLM 5.1/5.2 are not vision models - they are text input only.

That's actually pretty uncommon these days. All of the OpenAI/Anthropic/Gemini models accept images, and so do the other leading open weight families - Gemma 4, Qwen 3.6, Kimi 2.x.

In GLM's case image input would be useful because it's a model that scores very highly for tasks like web design, but without image input it can't take a screenshot and output HTML+CSS.

Don't get me wrong, GLM is a phenomenal model, but the image thing is a bit of a gap.

by simonw

6/17/2026 at 1:44:03 PM

Configure a subagent in your coding harness to spin up a new sub-session with any vision model for those tasks and feed the result back to the main model. No need for "one model that does everything"

by 0xbadcafebee

6/17/2026 at 8:00:45 PM

That doesn’t work well in a lot of scenarios. The text LLM doesn’t know what to look for in an image before it sees a description, you might need multiple rounds of back and forth.

by ricardobeat

6/17/2026 at 8:30:48 PM

Vision decoding outside of the latent space of the model is lossy, but claude opus's vision isn't that great outside of UI screenshots. I mean it works in a pinch. At least in my testing, if you're looking at non UI images, there are better image to text models that can turn into a very precise documents that any LLM can easily parse.

by jarjoura

6/17/2026 at 4:19:07 PM

Are you suggesting it should summarize the image in text or generate it in HTML or something else?

by WASDx

6/17/2026 at 12:28:31 PM

I don't see this being such a big gap. There are some use-cases for sure but apart from UX/UI work it is not really needed. Besides, none of the frontier models can replicate actual images - the can approximate at least in my own experience.

by _pdp_

6/17/2026 at 12:52:09 PM

One of my tests for a new model is dumping in a screenshot of a web page and seeing if it can recreate it from scratch in HTML and CSS.

Even the local models I run on my Mac are getting surprisingly good at that now.

by simonw

6/18/2026 at 12:33:51 AM

a pretty fun and quick tests i do with vision models is to screenshot the hackernews homepage and ask the model to return a json representation of the screenshot - qwen 3.5 0.8b did surprisingly well at this.

by kamranjon

6/17/2026 at 1:36:31 PM

Using llms to generate docx. Being able to rasterize and review is an important part of the process.

by tiahura

6/17/2026 at 6:16:23 PM

I've been using Google ai studio as a free vision bridge. Gemma 31B is dummy capable at vision and at 1500 rpd its basically unlimited.

by x3cca

6/18/2026 at 3:02:30 AM

Agreed, that's actually one step that will make people adopt it widely for customer facing AI Agent!

by abby3010

6/17/2026 at 1:31:49 PM

I had the same reaction with Deepseek V4 ! It would be more useful as a vision model

by ashenke

6/17/2026 at 11:42:34 AM

Artificial Analysis coding benchmark shows GLM5.1 on high pretty close to GPT5.5 xhigh in cost to run, with GPT5.5 on medium significantly less expensive. Compared to GPT5.5 medium GLM5.1xhigh is twice the cost and half the intelligence. They don't have GLM5.2 on there yet, but that'd a big gap to bridge.

https://artificialanalysis.ai/agents/coding-agents?coding-ag...

I thought I was "holding it wrong" until DeepSWE came along -- personally it seems to match my own experiences pretty well. Really makes me wonder how legitimate some of the internet noise is about open models. There's surely some use cases for them, not everything needs the absolute frontier (GPT5.5 on low is awesome), but if you want to be near the frontier everyone needs to be honest about the fact that we're only talking about Opus, Fable, GPT5.5.

by mrngld

6/17/2026 at 2:02:18 PM

It got 46.2 on DeepSWE in Z.ai's own run[1]. That would put it between Opus 4.7 xhigh and Opus 4.8 medium.

[1] https://z.ai/blog/glm-5.2

by undecidabot

6/17/2026 at 7:05:49 PM

If that ends up being true, GPT5.5 at 70 (and presumably Fable a bit ahead of that) is still in a different league, which was partly my point. To listen to online chatter, GLM5.2 is a tectonic shift in the landscape. In reality, it's just interesting. Probably safe to bet once the DeepSWE benches all get fully updated it won't even be on the pareto frontier.

I'm not accusing anyone specifically, but I've noticed Chinese bots swamping certain YouTube channels that, for example, cover US defense industry news. They'll downplay any and all technical advances, play up China's dominance, US cowardice, etc. All very transparent. I suspect some of the online conversation about open Chinese models is driven by that. How often do you see people talking about Mistral or Trinity? Never. Because they don't play that game.

by mrngld

6/17/2026 at 7:41:12 PM

There are definitely some Chinese bots + actual people (imagine that!) who like to talk up Chinese models, I'm one of them but I like to find out how good these models really are before saying anything.

GLM definitely isn't opus level yet but it's for sure good. I think it lacks some knowledge (when coding) that the frontier models possess, which is expected given that the model is probably quite small when compared to the frontier.

But people don't say much about Mistral, probably because they are nowhere as good.. And they don't have large population behind them to actually use them.

by osti

6/19/2026 at 1:29:50 PM

Someone hasn't been following Le chaton fat news, agi is coming

by desterothx

6/17/2026 at 1:29:25 PM

with open models you can get a subscription with privacy, at the same cost as codex.

openai, google and anthropic subscriptions are not available with privacy.

looking at the link there it's interesting that going from cursor cli to codex cli take gpt 5.5 from 7th to 3rd. but they didn't do open model in codex.

so, hard to say it's for sure a model benchmark. maybe open models are just shit at swe agent harness...it's not the most parsimonious explanation though.

by lukewarm707

6/17/2026 at 2:11:24 PM

> with open models you can get a subscription with privacy

Unless you're running it locally, aren't you just trusting some other entity?

by vadansky

6/17/2026 at 6:10:32 PM

While true - there are laws about saying you are doing the things you are doing, especially in certain regulated environments. If you are in the same country as the entity you are trusting, you have recourse if they are not living up to your trust usually in some form or another.

by conception

6/17/2026 at 2:50:47 PM

correct, you are trusting another entity.

however the legal terms are different, openai reads your data. they store it for 30 days, but of course once it hits the disk you can keep as long as you like in a civil case like nyt v openai.

the same for google and anthropic. so, it's not always nice if someone is paid to read your data for safety. people upload sensitive matters, personal videos and so on.

i wouldn't prioritise it myself but you can also know that the data will all come out in discovery if you are in a legal issue. maybe that's not important, but people thought it did matter to give some protections to patient records, legal advice and therapy. you upload that to gpt and it goes into discovery.

by lukewarm707

6/17/2026 at 7:14:22 PM

right, and on prem being an option is a god send, however you manage to do it

it's not a recommendation, its an option. if you don't have capital then it doesn't apply to you and move on. it wasn't an option for even people with capital.

come back in a few years when its more accessible

additionally I like that there are providers with faster special purpose processors for faster tokens/sec, all at different pricing strategies

so just pick something that matches your personal risk tolerance

by yieldcrv

6/17/2026 at 11:54:59 AM

I gave GLM 5.2 a spin on openrouter yesterday and it was mostly fine but it racked up $5 in token use in 30 minutes of (relatively slow) work.

It's easily 4x the cost of DeepSeek V4 but I didn't actually feel the results were that much better. I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.

Having better luck with MiniMax M3, from a cost/benefit ratio.

by cmrdporcupine

6/17/2026 at 12:58:37 PM

I really like DeepSeek V4 Pro. It's pretty smart and I get so much usage out of it on a $20 Ollama cloud plan.

With a good harness, that's my favorite model for any personal project. I use Opus 4.8 at work because i don't have to pay for it and of course I love it, but DeepSeek is like 80% there for one tenth of the price.

by pjerem

6/17/2026 at 12:09:21 PM

Try MiMo-2.5, I'm having astonishing success with it in opencode for cents per day. Not even the pro model.

by zooming

6/17/2026 at 4:01:15 PM

I've found MiMo-2.5 is fun for front-end design since you can use its multimodal capabilities to drop in whatever it produced and correct it for you.

by spelk

6/17/2026 at 1:27:38 PM

> I had GPT 5.5 in Codex review it after it was done and there was plenty of slop to go around.

GPT can find fault in everything and anything including its own work.

by re-thc

6/17/2026 at 3:17:11 PM

AI review generally will find fault in anything. Any non-trivial code has multiple solutions with different tradeoffs. Any code can be over-engineered for theoretical edge cases and future use cases you don't need. No matter which solution you pick you can always at a minimum say that some alternative just looks and reads better.

Code is somewhat artistic. If you don't have well defined standards and priorities, the AI review cycle can spiral infinitely figuratively debating what makes art good, and your code will be no better for it.

by gbingles

6/17/2026 at 3:22:34 PM

This is correct, but I'd say there's something beyond that that's more specific about Codex + GPT models though. They've done some sort of training that makes it far more diligent about seeking out data races, unhandled errors / negative cases, and missing test coverage than the other models I've played with. It also seems more prone to testing its hypothesis.

This makes it slower to work with for prototyping, and it will, if not properly disciplined, litter your code with "legacy adapters" and "bridge code" and temporary incremental refactoring steps [arguably not terrible for work in real commercial software projects]. And it will create too many unit & integration tests, if you're not careful.

But it does, in my opinion, tend to produce more reliable software and I trust it far more than I did when I was working in Claude.

When I could afford it, I had both plans running, Claude to produce new features, and then Codex to brutally critique it battle test it, sharpen the edges, and produce better tests, and this flow went extremely well.

Now I just work with Codex and various open models.

by cmrdporcupine

6/17/2026 at 2:59:47 PM

That's what I love about it, and I wish I could find an open model that was as diligent.

Somehow it's just way more careful than the others, and also much better at empirical verification of its hypothesis, writing tests, etc. I am assuming a lot of RL done on that kind of flow, and on seeking out negative cases, failure points, race conditions.

by cmrdporcupine

6/17/2026 at 1:27:12 PM

DeepSWE “feels” like the right benchmark in comparison to Artificial Analysis indices and other coding benchmarks. And by their metrics, GPT-5.5 is still king in token efficiency, speed, and overall intelligence per dollar.

https://deepswe.datacurve.ai/

Fable 5 is cool and all, but we have not yet seen GPT-5.6.

by ttul

6/18/2026 at 7:30:39 AM

GLM5.2 isn't even on this benchmark

by slagfart

6/19/2026 at 5:36:15 PM

True. Z.AI ran that bench themselves and report 46.2, which is lower than GPT-5.5 and Opus 4.8, but crushing the other open weights models.

https://z.ai/blog/glm-5.2

by ttul

6/17/2026 at 10:37:37 AM

I've been playing with this model a fair amount over the last 24 hours, and I can confirm it's quite capable, while being a little bit verbose (I've seen it reconsider things 3-4 times in thinking traces before deciding on a path forward), and not being quite as good as GPT5.5 at working through complex abstract requirements.

Honestly it's good enough that I feel comfortable recommending a Z.AI sub + a $20/mo OpenAI sub for all but the most AI pilled multi-orchestrators, or the die hard Claude fans. GLM writing + GPT reviewing/debugging feels pretty unlimited and minimally worse than just doing everything in GPT with the $200/mo plan.

by CuriouslyC

6/17/2026 at 11:45:16 AM

> while being a little bit verbose

Discovered today that they set reasoning effort to max by default. So that’s probably why

by Havoc

6/17/2026 at 1:43:29 PM

> GLM writing

This is honestly what I care bout the most now, which is how well they can write. I think we have reached a point now, if you know how to program, you can provide enough information for the models to pretty much do what you need.

What they still struggle immensely with is the writing which has too many nuances but they are truly getting better.

by sdesol

6/17/2026 at 10:54:56 AM

After having got a taste of Fable 5 for me Opus 4.8 doesn't cut it any more -- and I don't know how to put this, I don't know if it's just me, but it's rhetorical flourishes are starting to really grate on me, never mind that it is at times deliberately weasel-wordy and economical with the truth until pressed. Opus 4.8 is definitely a stronger coding agent than DeepSeek 4.0 or Kimi 2.7 succeeding where they flounder and fail but its way of expressing itself conversationally is making me reconsider my subscription …

by igravious

6/17/2026 at 11:08:01 AM

You are not alone. How about GPT 5.5? Does it come close to Fable 5?

by elwebmaster

6/17/2026 at 11:22:25 AM

GPT 5.5 xhigh is smarter than Fable but Fable like Opus 4.8 as well is faster and seems more “agentic”. It’s easy to test this. Build a fairly complex software with Claude(opus or Fable).

Review the commits with both Claude and GPT 5.5 Xhigh. You can see that Fable is still sloppy(er) compared to GPT. You can test it the other way around as well(drive the dev with GPT and review with GPT and Claude). You get the same result Claude has an edge though and that’s on building more beautiful user interfaces.

by theplumber

6/17/2026 at 9:56:33 PM

I somehow take the opposite on almost everything here.

4.8 xhigh or max has a slight edge on 5.5 xhigh, for very complex logic perhaps it loses but it's just better in almost every other way, especially code quality. GPT is a slop machine outputs way too much and over-abstracts, plus its communication is so bad in comparison.

Fable was for sure a step above GPT, I tried them both against a few of the same hard tasks and it was not a small difference.

by nwienert

6/17/2026 at 11:12:52 AM

5.5 is pretty good. It's no Fable though. It is definitely better than opus tho.

by fragmede

6/17/2026 at 10:58:51 AM

This is my workflow. And then once a day I copy paste the code into the free Claude Sonnet so it comes out actually readable.

by andai

6/17/2026 at 11:26:42 AM

Knowing very little about how to run these, how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?

It’s expensive, and not as capable as the frontier models, but would have some pretty big benefits around privacy and agency.

by CubsFan1060

6/17/2026 at 11:54:00 AM

I know of multiple businesses in Europe that have been doing that for a while with 70B models, and are upgrading hardware to run the new crop of 700B-1T models (really started around Kimi K2, but buying and hosting that kind of hardware takes time)

Not everyone is willing (or even legally able) to send their trade secrets to OpenAI or Anthropic

by wongarsu

6/17/2026 at 2:06:20 PM

While certainly there are such cases with trade secrets, it's worth noting that even large banks typically have a provider like Azure or AWS onboarded.

There they can deploy these models while using the existing legal frameworks.

by user43928

6/17/2026 at 12:09:43 PM

What kind of hardware/price does it take to run those?

by CubsFan1060

6/17/2026 at 12:35:37 PM

Nvidia will sell you an entire server rack ready for inference. Or maybe you can roll out your own Blackwell based system.

We’re approaching a world where running a primer frontier model is possible on a workstation, probably will have something under $30k that looks like a desktop for Nvidia’s next generation. It sounds expensive, until you look at your Anthropic bill.

It’s similar unit economics as could computing for the open models. You can save a ton on the expenses by buying the hardware, but it requires a lot of in-house expertise, and you get the most value if you keep the system operating around the clock. The big kink is open models are usually 2 quarters behind frontier, and your competitors are probably trying to get access to mythos.

by bitmasher9

6/17/2026 at 3:01:44 PM

"approaching" is doing some work there. $30K today will get you 90-144GB usable VRAM with solid system RAM and disk and CPU. A single B200 chip at 180GB is $40K. Unfortunately that is nowhere close to being able to run a 750B param model. For something like that, we're getting closer to 1TB VRAM (8+ H200/B200), and then 1M context KV cache is many more GBs on top of that.

That's a $500K-$1M+ rig as of now. That's a lot of $200 subscriptions to break even, but reasonable if you are paying Anthropic $25/M tokens. Then of course there's the power, cooling, and maintenance to consider...

But yeah, I can see if the prices come down 10x in a few years, or crater after the bubble, $30-40k might get you a decent machine.

by program_whiz

6/17/2026 at 4:29:22 PM

> Unfortunately that is nowhere close to being able to run a 750B param model. For something like that, we're getting closer to 1TB VRAM

You don't have to run a model from VRAM, or even from a sizeable amount of RAM. These choices only ever make sense when serving the model at scale, to hundreds of simultaneous users or more.

by zozbot234

6/17/2026 at 5:45:35 PM

For workstation inference a unified memory architecture would be a good cost/performance balance, while keeping COGs reasonable.

512GB unified memory macs are available, with the ram upgrade costing a few grand.

by bitmasher9

6/17/2026 at 12:23:46 PM

For an 8-bit quant (what people call "near lossless") you are looking at something like 4xMI350X, which comes out to about $150k after adding the rest of the server. More if you go with Nvidia instead of AMD. More if you want more than maybe 8x concurrency

But prices are changing rapidly, and not for the better

by wongarsu

6/17/2026 at 11:32:34 AM

So far there seems to be one major use-case for complete privacy, and that is legal work. You don't need top of the line models to search vast amounts of text in discovery and it needs to be completely confidential. There's quite a few lawyers over on r/localllama showing off their multi-GPU builds. Coincidentally they also have the vast funding required for it.

by moffkalast

6/17/2026 at 11:58:28 AM

This is not a new situation. This was happening also when good vision models like alexa net were coming through, especially for OCR. Companies had choice between cloud or self hosting with GPUs. But turns out, problem is usage patterns.

Your usage will peak during certain timezone work hours(even if you are a huge multinational company most of your engineers/users tend to be from only a few locations), so then you have a bunch of gpus doing nothing the rest of the day. especially with latency sensitive stuff, this is a decades old tradeoff problem, its not unique to llms

by MikhailTal

6/17/2026 at 11:31:40 AM

Unless you have genuine national security concerns, you’d be better off just negotiating a commercial agreement with privacy protections with a couple of existing vendors.

by petesergeant

6/17/2026 at 11:50:18 AM

I think that's true until it isn't, which may end up being the problem. Fable/Mythos doesn't fall under the ZDR agreements with Anthropic. And I'm curious if others will follow suit.

by CubsFan1060

6/17/2026 at 11:54:37 AM

if you can afford the investment you get stable low costs for years with better security (at least if your cyber team is good). its even better in regulated industries where some vendors might add a premium for hipaa/soc/pci dss compliance to the point its a lot cheaper to self host. for a smaller business its not worth it and you should just use a hosted open model.

by tancop

6/17/2026 at 1:06:10 PM

> to the point its a lot cheaper to self host

I'm pretty skeptical, especially given typical utilization patterns. Do you have numbers, or this is just vibes?

by petesergeant

6/17/2026 at 11:44:16 AM

It’s a ~750B model so still a hell of a lot of vram

Would need to be a pretty determined medium biz

by Havoc

6/17/2026 at 11:28:46 AM

> how close are we to medium or larger businesses starting to buy hardware to run models like this to keep the models local?

Years.

Even Microsoft said they don't have enough for Github and need to call Amazon.

Getting a few even at decent prices is hard. Unless the shortages goes down...

by re-thc

6/17/2026 at 10:56:57 AM

> On the Intelligence vs. Cost per Task Pareto Frontier: GLM-5.2 is on the Pareto frontier of the Intelligence vs Cost per Task chart, with the lowest cost per task among models at its intelligence level. GLM-5.2 costs ~$0.46 per task, compared to GLM-5.1 ($0.25), Kimi K2.6 ($0.31), MiniMax-M3 ($0.18) and DeepSeek V4 Pro (max, $0.05)

am i missing something?

by tensegrist

6/17/2026 at 11:23:22 AM

I think they’ve just picked poor peer examples. Instead of choosing other models near 5.2 on the intelligence scale, they’ve picked some open models from further down the scale.

by OtherShrezzing

6/17/2026 at 8:42:43 PM

pareto frontier does not mean cheapest.

by acchow

6/17/2026 at 11:06:13 AM

Some models are heavily subsidized. Total params & active params are better measurement of inference cost.

by xiaoyu2006

6/17/2026 at 11:07:55 AM

No models are subsidised -- there are lots of third party hosting services that will still run at breakeven/profit. (except Deepseek after discount)

by simianwords

6/17/2026 at 12:32:10 PM

> No models are subsidised

We have no proof in either direction, it's not like we had access to their financial numbers in details.

And the pricing itself muddies the water, as input tokens that are already in the KV cache are practically free for the provider, whereas other tokens are expensive. So they could still make money overall thanks to people having multi-turn conversation (and as such, paying multiple times for the same token), but lose money on actual compute done.

> there are lots of third party hosting services that will still run at breakeven/profit.

How can you be sure that they are making profit directly from token price, and are not billing at marginal cost (i.e. electricity price, without counting the cost of the GPUs) and aiming to make a profit later on from the valuable training data that they are collecting in the process?

by stymaar

6/17/2026 at 1:43:19 PM

> How can you be sure

You are free to believe that they are doing all this. Or you can simply believe the intuition that models are getting cheaper by the day. I can run Gemma 4 31B from my laptop today.

by simianwords

6/17/2026 at 4:04:49 PM

> Or you can simply believe the intuition

Sure, you can believe you intuition as much as you want, but telling strangers over the internet that they are wrong because “I trust my intuition” is… awkward.

by stymaar

6/17/2026 at 4:27:50 PM

At some point it does come to intuition. Even if the companies IPO and share their financials, you can always argue that they might be lying.

by simianwords

6/17/2026 at 4:38:40 PM

Again, there's a difference between relying on intuition in your life (which we all do, lacking perfect information that would allow us to avoid relying on it), and telling someone they are wrong because your intuition says so.

by stymaar

6/17/2026 at 4:57:48 PM

> as input tokens that are already in the KV cache are practically free for the provider,

not at today's RAM prices.

by jgalt212

6/18/2026 at 7:08:16 AM

RAM price don't change anything. You can't fit an infinity of tokens in the KV cache, but the ones that are in there when a user request them are still practically free.

by stymaar

6/19/2026 at 1:14:36 AM

so the utility of the KV cache has nothing to do with available RAM?

by jgalt212

6/19/2026 at 4:08:34 PM

This proposition has no logical link with any of the above. Read again.

by stymaar

6/17/2026 at 11:49:15 AM

It's also third best overall on "AA-Omniscience Non-Hallucination Rate", far higher than DeepSeek, GPT 5.5 or Fable.

That's the one benchmark that allows LLMs to answer "I don't know" and punishes them for trying to bullshit their way through the questions

by wongarsu

6/17/2026 at 4:43:44 PM

GLM 5.2 is the first model we've tested that is unambiguously on par with, or better than Opus 4.6 (although as usual, we have GLM 5.2 and most other Chinese models a bit below most other benchmarks with more vulnerable test methodologies).

Data at https://gertlabs.com/rankings

by gertlabs

6/18/2026 at 1:40:19 AM

I really have to take your score with a grain of salt because Opus 4.5 does better than Opus 4.6

by nsoonhui

6/18/2026 at 4:43:31 AM

They're within confidence intervals of each other, but remember how much discussion there was that Opus 4.6 had been nerfed in March. We averaged samples over the entire lifetime of Opus 4.6, which likely served many different underlying checkpoints. Even the best version of Opus 4.6 was hardly an upgrade.

We find a lot of interesting anomalies with our benchmark that hold up under large sample sizes.

by gertlabs

6/17/2026 at 7:31:07 PM

[dead]

by minraws

6/17/2026 at 6:32:43 PM

I added it to my benchmark based on Mythos-reported bugs, and it's better than GLM 5.1, but still behind several other models, maybe most directly comparable to Qwen 3.7 Max. But, several other open models, including small self-hostable ones (Gemma 4 and Qwen 3.6), found the same number of bugs, 3 of 9. Though it also gets partial credit for reporting one bug in the right spot, but kinda misunderstanding the bug. I also added Kimi K2.7-code in the same run, and it did poorly, consistent with 2.6 performance. Anyway, there are better, cheaper, models on this particular benchmark.

https://swelljoe.com/post/will-it-mythos/

(This small benchmark doesn't prove anything. It's a limited data set and each model only gets one shot at each file in the corpus. But, I find it useful for quickly sussing out if a model can reason about pretty complicated problems in code.)

by SwellJoe

6/17/2026 at 8:51:57 PM

[dead]

by be7a

6/17/2026 at 10:45:09 AM

According to many benchmarks this model is straight up frontier level and Zai seriously cooked. Some of these numbers are incredible.

Excited to see if this turns out to be a Open Weight Opus 4.5 or better.

by kingstnap

6/17/2026 at 11:13:17 AM

The only benchmarks that matters is your actual task.

I've had models that benched poorly but performed great. And I constantly see models at near the top of AA, which are terrible.

There doesn't necessarily seem to be a lot of overlap between benchmarks and real world usage. (Let alone common sense!)

As far as they go, though, these harder benchmarks match my experience more closely:

https://deepswe.datacurve.ai/

and https://cognition.ai/blog/frontier-code

Where we see "top" models drop way down in score when given longer tasks.

That being said, I've had a reasonably pleasant time with GLM-5.2 so far. (And have had an OK time with DeepSeek as well.)

By the time I'm done testing all the Chinese models, they'll be obsolete :)

by andai

6/17/2026 at 3:25:15 PM

According to reports in this thread it is somewhere between Opus 4.7 and 4.8. This is effectively frontier.

by adastra22

6/17/2026 at 11:21:27 AM

In my tests[0] GLM-5.2 is not much better than GLM-5, and overall DeepSeek V4 Flash seems to be the better/more cost-effective choice:

[0]: https://aibenchy.com/compare/deepseek-deepseek-v4-flash-high...

by XCSme

6/17/2026 at 11:48:59 AM

I think the problem is, as can also be seen on other benchmarks, is that most models nowadays are focused more and more purely on tool calling and coding.

This means, that models are losing more and more general and domain-specific knowledge.

Look at those graphs on ARtificialAnalysis, GLM-5.1 still performs similarly or better:

AA-Omnisicence Accuracy: https://i.snipboard.io/5DYmpx.jpg

IFBench: https://i.snipboard.io/74kg0R.jpg

I still feel like models are not getting any smarter for a few months already, they just changed their training to be focused more on some areas than others, so shifting the intelligence from one place to another, not necessarily increasing the overall intelligence or "AGI" score.

by XCSme

6/17/2026 at 10:57:46 PM

Well, in that example it still seems the big players are increasing overall "intelligence" as Fable tops the list.

OpenAI has big incentives to improve general interligence as a large percentage of users use ChatGPT for support, finances, questions, etc. Not just coding.

by HDBaseT

6/17/2026 at 11:55:49 AM

man, i love dsv4-flash but i found its weaknesses in complex projects with multiple moving parts. tried kimi 2.6 and it understood and could work on the task. bigger is better..

by sourcecodeplz

6/17/2026 at 11:04:39 AM

This open source model is quite near SOTA with only 700B/40B MoE. Truly efficient.

by xiaoyu2006

6/17/2026 at 11:23:51 AM

So this basically means we will have a near opus level model able to be run locally in the next couple of months right?

QWEN 3.6 27b is already pretty good, but it should be possible to get a better option now that runs in the same hardware, right?

by Pragmata

6/18/2026 at 2:40:38 AM

So much depends on the thinking effort, it's almost meaningless to compare these models without specifying it. GLM 5.2 needs to run with max thinking effort to be competitive with the leading-edge models from OpenAI and Anthropic. That slows it down quite a bit in my experience. Meanwhile, those models have thinking-effort knobs of their own that make a big difference, especially in GPT 5.5's case.

I have been messing with an early NV4FP quant of GLM 5.2 and so far, that model in its Max setting outperforms GPT 5.5 on its default setting. But GPT 5.5 still pulls ahead once I crank up its own reasoning effort. I imagine the same is true of Opus 4.x but haven't pitted them against each other yet.

by CamperBob2

6/17/2026 at 11:38:52 AM

Which Opus?

GLM-5.2 is already close to Opus-4.7 level:

https://aibenchy.com/compare/anthropic-claude-opus-4-7-mediu...

by XCSme

6/17/2026 at 11:39:24 AM

Oh, or you meant a smaller model than GLM-5.2 with similar capabilities?

by XCSme

6/17/2026 at 1:11:12 PM

Probably not. Qwen3.(5|6)-27B seems like an "accidental freak". I'm not even sure they know what they did to create that. A decent amount of the team members left after that, so unfortunately, we might not be seeing another small model that packs such a punch for a while. Hopefully the team is studying their entire training recipe for that and is able to replicate. If they are, then a 50-70B dense model might give us such capabilities...

by segmondy

6/17/2026 at 7:02:33 PM

Gemma 4 is competitive with Qwen 3.6. I had vague feelings that Qwen was better at coding tasks, based on anecdotes and public benchmarks, but I've been doing some benchmarking lately, and Gemma 4 31b is consistently beating Qwen 3.6 at the really hard stuff (finding hard security bugs, vision tasks for fixing UI layout or categorizing assets, in particular..and for vision, nothing self-hostable beats Gemma 4 12b, including 31b).

I'm still hoping for a bigger Gemma 4 version, but I think they may be worried about competing with their own hosted models, since Gemma 4 is already better than a lot of Google's proprietary models that are still available in AI Studio.

But, it is a shame that Qwen probably won't be doing more open models going forward. It is really strong for its size.

by SwellJoe

6/17/2026 at 11:51:16 AM

Yep! I'm running things locally on a RTX5080 + RTX1060 + 64GB DDR5 ram, and would love to get a more capable model if possible!

QWEN3.6 27b is pretty good, but i can still notice some spots where it's not as good as the frontier models.

by Pragmata

6/17/2026 at 1:07:47 PM

Why wait for the next few months? There are plenty of better models that you can run today locally. Qwen3.5-397B beats Qwen3.6-27B. MiniMax2.7 is a longrun horizon monster. (I haven't given 3 much of a try yet). KimiK2.6/2.7, MiMoV2.5/MiMoV2.5-Pro and GLM5.1 will wreck Qwen3.6-27B any day on any task.

by segmondy

6/17/2026 at 1:16:24 PM

[dead]

by Pragmata

6/17/2026 at 1:29:17 PM

Just ran and scored 63 3d model generations (via code) across high and no reasoning. 3D Modeling benchmark quickly shows spatial, logic and code performance of the model so I think it's a very good indicator of the quality.

Here are the results compared to Gemini 3.5 Flash:

    Model + config          CodeErr/gen   Cost/gen   Median time   Quality
    gemini-3.5-flash, low      0.71        $0.18        68s       baseline
    GLM 5.2, reasoning high    0.61        $0.18       289s         -6.0%
    GLM 5.2, reasoning off     1.52        $0.10       126s        -13.6%

Although it is cheaper, it is significantly slower, and results are worse overall. Surprisingly - high reasoning produces less code errors than gemini 3.5 flash, but when I actually look at the models they are worse.

Edit: I recently ran evals with Kimi 2.7 and MiniMax-M3 and this is clearly open source SOTA model, by far.

by ponyous

6/17/2026 at 1:46:18 PM

Very interested in this! Can you share more about the modelling method (eg, three js?), the task list, and outputs here?

I think there's probably some good juice to squeeze in terms of spacial awareness by doing a benchmark something like

- give 3d modelling task

- render and snapshot from a variety of angles

- feed to third-party vision model for a "what is this" type query

- grade on end-to-end accuracy

Bonus points for asking the vision model something like "how beautiful is this 1-10".

by NiloCK

6/17/2026 at 2:26:36 PM

I don't have the eval results live yet, so I cannot share them yet.

I was benchmarking using a soon to be released new version of my AI CAD modeling software[0]. It's basically an agent that has access to tools that can execute build123d scripts, get sculpted models, blender to combine sculpts + parametric models, tools to inspect the model (visually and with code), search datasheets, ...

I tried what you recommend a while ago (asking an AI to evaluate using different angles) and the AI evaluations were extremely bad - barely any correlation to what I scored. Things have gotten better, but I don't trust it enough yet.

Here is how I score adherence (and how AI did as well, but I tried methods where it would just give back a boolean "pass" or not):

    <0.2 → Poor – Misses core intent; largely irrelevant or incorrect.
    <0.4 → Weak – Partially relevant; significant omissions or errors.
    <0.6 → Fair – Covers main points but lacks completeness or precision.
    <0.8 → Good – Mostly accurate; minor gaps or deviations.
    <=1.0 → Excellent – Fully aligned; precise, comprehensive, and faithful to intent.

Here is the scenario list (prompts are much more detailed):

    dragon-bottle-stopper
    editing-param-mid-conv
    editing-parametric-enclosure
    editing-swap-material-param
    editing-text-edit-cube
    multi-turn-bird-house
    multi-turn-dice-tower
    multi-turn-modular-planter
    multi-turn-phone-stand
    multi-turn-shelf
    one-shot-bookend
    one-shot-cable-clip
    one-shot-chess-queen
    one-shot-coaster
    one-shot-coffee-cup
    one-shot-dog-tag
    one-shot-dragon-figurine
    one-shot-hex-bracket
    one-shot-keychain-fob
    one-shot-low-poly-tree
    one-shot-pegboard-hook
    one-shot-pi4-case
    one-shot-threaded-jar

[0]: https://grandpacad.com

by ponyous

6/17/2026 at 4:01:11 PM

Very cool project. Thanks for sharing!

by NiloCK

6/17/2026 at 2:25:47 PM

Would you be able to run it against Gemini Flash (not Lite) 3.0, high thinking?

by ComputerGuru

6/17/2026 at 2:28:17 PM

Absolutely. Running it now, will update this comment in about 30 mins.

Edit: Surprisingly very good results with 3.0 flash with high thinking.

Cost: $0.06

Duration: 3.22 min

Code Errors: 1.3 per attempts (meaning on average it had to retry 1.3 times)

Adherence was on par with 3.5 flash Low thinking

by ponyous

6/17/2026 at 2:32:13 PM

Thanks! I’ve still been using 3.0 a lot, the price-to-performance ratio absolutely kills compared to Google’s other and newer offerings.

by ComputerGuru

6/17/2026 at 10:57:28 AM

Correct me if I'm wrong, but neither DeepSeek nor GLM have image input modality. This makes them less useful when looking at UIs, photos, screenshots, etc. doesn't it? Or do they have alternate ways of doing so?

by rahidz

6/17/2026 at 1:24:41 PM

DeepSeekv4+ will have image capability, they said so in their paper. GLM whenever they decide to. Both companies have they tech and for whatever reason haven't decide to prioritize it. Both of their OCR are SOTA among all OCR models closed or open. GLM demonstrated they know how to do this, with GLM-4.6V.

by segmondy

6/17/2026 at 11:26:22 AM

Yes, you are right (as far as I'm aware). For things where you need the LLM to look at screenshots, photos or other images you can use Kimi-K2.6/K2.7 - comparable pricing, somewhat comparable performance and quality. You can even probably combine two models (e.g Kimi and GLM) in one agent, using Kimi for multimodal inputs and GLM for everything else, although 1) I'm not sure if this will not cause some kind of context poisoning with low-quality patterns for better performing model (e.g. in some cases Kimi may be worse than GLM, but GLM, when following up, may adopt the same reasoning patterns as Kimi, undermining it's own performance), and 2) I'm not quite sure if it's possible with the tools currently available (I'm not really into agentic or chatbots stuff to be honest).

by dryarzeg

6/17/2026 at 11:16:30 AM

They do not and it sucks for certain tasks.

It also means that if they actually trained with vision, they'd be on par with Anthropic models as vision seems to improve model performance across the board even for non-vision tasks.

by mordae

6/17/2026 at 11:46:22 AM

Many other open source models have vision but they don't compare to GLM in terms of coding quality. So I don't think it's because of vision that the frontier models are better, it's more that they are probably just much bigger models.

by osti

6/17/2026 at 2:34:39 PM

it helps giving them a cli vision tool (curl to openrouter vision model for example)

by freigeist79

6/17/2026 at 11:16:45 AM

That's right, but there are other recent open weights and relatively big LLMs that are multimodal, e.g. MiniMax-M3.

With open weights LLMs, it is affordable to use many different models, each for whatever it is better.

Moreover, for analyzing "UIs, photos, screenshots, etc." there are small models that can be run locally on smartphones or laptops, e.g. IBM granite-vision-4.1-4B, certain Google Gemma 4 variants and certain Qwen variants, whose output you can use as input for a big LLM, in order to accomplish some more complex task.

by adrian_b

6/17/2026 at 1:48:45 PM

Configure a subagent in your coding harness for vision, add a prompt about the vision use, configure a vision model for it, modify your main agent's prompt to use the vision subagent for vision tasks. Now your non-vision model has vision support.

by 0xbadcafebee

6/17/2026 at 11:46:35 AM

They have a separate VL model but never tried it

by Havoc

6/17/2026 at 3:05:24 PM

Fun fact: Zhipu aka Z.ai, Knowledge Atlas etc., the company that made GLM, is listed on Hong Kong stock exchange, is up over 10x since the IPO at the beginning of this year.

by osti

6/17/2026 at 10:48:48 AM

I like their models, super cheap - I'm a Lite plan subscriber, and subjective performance seems to be same as lower Anthropic models, useful for lots of grunt work. The problem is that Ziphu really __really__ struggle with capacity - everyone is complaining of timeouts or very slow speeds. I can't get direct access to the model though I see it is in OpenRouter so I may play. But the capacity issues means DeepSeek is my main provider these days

by davidwritesbugs

6/17/2026 at 11:11:40 AM

I am helpful.

DeepSeek V4 has been quite amazing in our workloads and it operates at a fraction of the cost. I have not tried GLM 5.2 but it seems that it hits a sweet spot.

by _pdp_

6/17/2026 at 1:36:05 PM

Your system prompt is showing.

by LUmBULtERA

6/17/2026 at 2:37:51 PM

Maybe he meant "hopeful"...

by kreddor

6/18/2026 at 10:42:13 AM

This is exactly what I meant - sorry honest typo.

by _pdp_

6/18/2026 at 1:47:03 PM

haha fair enough. Sorry!

by LUmBULtERA

6/17/2026 at 2:46:41 PM

GLM 5.2 feels like Opus 4.6 level. I actually think 4.6 and GLM work better in practice than opus 4.7 or 4.8 as I find both of those more erratic and seem to randomly have a super dumb turn. That random bad turn I see doesn't seem to be hitting the benchmark scores but they make 4.7 and 4.8 very hard to use for me. GLM is more stable like opus 4.6

by leemoore

6/17/2026 at 11:08:15 AM

I've made a comment before that 5.1 will sometimes get stuck looping over a simple decision or statement. It will basically contradict and then not realize that one option is the definite option. Sometimes it's two statements that aren't even exclusive. Nonetheless, a lot of tokens that get wasted from this.

I haven't extensively used 5.2 yet, but it seems a lot better.

by ramon156

6/17/2026 at 1:09:53 PM

For anyone who's interested, I've put together a simple site for sharing ratings/opinions on models at a task-specific granularity. https://model.reviews/

The idea is that benchmark score comparisons are useful for a large cross-product comparison across models + their settings, but less useful if you're looking for the best model for <your-specific-task>. So I thought having a place to review and comment could be beneficial to people.

I'm not sure how best to get the corpus bootstrapped (i.e. people will likely only visit/post on the site if there's already activity), so posting it here for anyone who'd like to contribute.

by m-dot-reviews

6/18/2026 at 12:11:48 AM

I get a 500 when clicking “Explore the Models”

by swingboy

6/19/2026 at 2:34:59 AM

Oops, thanks for telling me that. I think the issue should be fixed now.

by m-dot-reviews

6/17/2026 at 11:56:52 AM

I have been trying out GLM 5.2 and I am really impressed by it for the most part.

To all people on Hackernews, I am curious as to what agent harness are you using it with.

Previously I was using opencode and then I switched to using Opencode + obra/superpowers and creating custom skill.md themselves for it. I found things to take more time and intervene more but the result of it has been that I have found it to work better.

Now I have also started using oh-my-pi as well and I found it to be faster compared to Opencode.

I am unsure how much of there is a difference to it and how much of things are placebo but what is your opinion regarding the best Agent harness for GLM 5.2?

by Imustaskforhelp

6/17/2026 at 1:50:27 PM

I just used CC with GLM, I was satisfied.

by Alifatisk

6/18/2026 at 3:41:17 AM

I code daily with AI - real programming tasks, professional, real work, read customers, I use below 3:

- codex 5.5 medium - best results less hand holding medium speed

- opus 4.8 max - mediocre with hand holding medium speed

- glm 5.2 max - mediocre with hand holding and super slow

- composer 2.5 - mediocre with hand holding and super fast

I use all, since i run mulitple coding in parallel. disclosure - I use rexide which we created for all these agents to run in parallel with good visibility and feedback.

by tomerbd

6/17/2026 at 8:30:06 PM

Launch announcement from four days ago: https://news.ycombinator.com/item?id=48518684

The requirements to run this model locally: https://www.reddit.com/r/LocalLLaMA/comments/1u8ai2a/glm52_i...

by redbell

6/18/2026 at 9:10:29 AM

Z-ai/GLM’s KV caching technology is truly impressive; the implicit cache hit rate of its official API exceeds 95%, far surpassing other APIs that support implicit caching, such as Gemini and Qwen. I’ve been pondering the architectural design behind this, though I haven't yet formed a fully coherent theory.

by bizer

6/17/2026 at 12:34:29 PM

FYI.. This is coming with 3mil GLM 5.2 tokens right now. (Needs login. Google SSO fine) https://zcode.z.ai/en

by dizhn

6/17/2026 at 2:08:21 PM

Where can I read more about the coming 3mil GLM 5.2?

by Alifatisk

6/18/2026 at 12:05:40 AM

I meant the credits are included in the application you download from there. Install, log in (via google) and You'll get glm 5.2 + turbo. Mine actually either got replenished or they are not checking by login because they are full after installing at a second desktop.

by dizhn

6/18/2026 at 6:18:47 AM

Oh, cool

by Alifatisk

6/17/2026 at 8:06:52 PM

Seems really good at frontend work, and as a result on remotion programmatic videos. Not the best yet, thats still Gemini 3.1 pro(trained on actual videos) or Fable, but often better than what Opus can come up with

https://mesmer.tools/benchmarks/ai-video-generation

by mesmertech

6/17/2026 at 12:31:08 PM

The problem with these benchmarks is that the Chinese models tend to be incredible on paper, and absolutely terrible in practice :/

by JustSkyfall

6/17/2026 at 1:34:30 PM

This was a problem with older Qwen/MiMo/Kimi models mostly. GLM has always been on the more robust side, and newer iterations from all those labs have improved as well. The only lab I've seen regressing this way is DeepSeek, 3.2 was fairly robust but 4.0 feels more benchmaxxed.

by CuriouslyC

6/17/2026 at 12:51:06 PM

I have used GLM since version 4.8 I think and do enjoy using them. More then other models like Kimi or Deepseek. Though only tested them on smaller private projects.

by Mashimo

6/17/2026 at 2:05:55 PM

> I have used GLM since version 4.8 I think

You probably refer to GLM-4.7

by Alifatisk

6/17/2026 at 1:31:30 PM

I beg to differ. I replaced a $40/mo GitHub Copilot subscription where I used Opus 4.6 and GPT 5.5 with a $10/mo opencode Go plan where I use mostly DeepSeek V4 Flash and testing MiMo 2.5.

I work on mid-sized projects currently (200k to 1kk lines of code).

by bel8

6/17/2026 at 2:03:27 PM

> 1kk lines of code

Isn't that a million?

by Alifatisk

6/17/2026 at 2:23:09 PM

Yep. I consider up to a million lines of code as mid-sized.

When I worked in banking, the codebases were often larger than a million.

by bel8

6/17/2026 at 1:19:56 PM

You are obviously lying because it shows you have no experience with. GLM since 4.5 have been crushing it. all their models since then haven't skipped a beat. 4.5/4.5-air, 4.6, 4.7, 4.8, 5, 5.1. That aside, MiMoV2.5, MiniMax from 2.0, DeepSeek from V3, Kimi since V2, Qwen since 3, Hy3 have all been amazing models. All from China, we need to get over it. China is not losing yet as far as the AI race is concerned.

by segmondy

6/17/2026 at 2:04:55 PM

Is there a GLM-4.8 model?

by Alifatisk

6/17/2026 at 1:04:16 PM

[flagged]

by jingpostmedia

6/17/2026 at 8:03:04 PM

Also so wild that it's relatively compact. 753B-40A is so reasonable, shows incredible scaling in what the model can do, without just throwing heaps of new parameters in.

This is silly but I dig how 753 is very close to 745, which is the watts in a HP. 1bHP parameter model. Silly, but I enjoy it.

by jauntywundrkind

6/17/2026 at 1:39:54 PM

These open source models need better multi-turn capabilities. They are always lacklustre in "agent mode". Whether it's just less RL, whatever, it's a worse "product". Whereas it feels like the frontier labs have been all-in on "agentic" multi-turn reasoning for a long time now.

by alansaber

6/18/2026 at 7:03:00 AM

They've come along pretty far now.

I remember when there was hype around GLM 5 reaching great heights on benchmarks but eventually failing on practical coding and reasoning tasks. I guess this time the hype is real.

by gauravvij137

6/17/2026 at 5:11:22 PM

It's probably a good model but they used GLM 5.1 to code their infra.

I signed up to their max plan yesterday, did some light coding work, and i'm at 180M tokens used and 40% weekly quota gone.

Even when tokenmaxxing on the Claude Max or GPT $200 plan, i couldn't get more than 20% quota gone per day.

by guybedo

6/17/2026 at 6:03:38 PM

Are you using it for long context windows? I burn through my 5hr quota with GLM almost instantly on 200k+ contexts, but if I reset every ~100k or so it's much more manageable.

by bigyabai

6/18/2026 at 4:24:21 AM

Before you go and sign up to the max plan like I did, they are obviously struggling for capacity. I'm getting API rate limited and 429'd on a simple "hello"

by aunty_helen

6/17/2026 at 1:51:08 PM

what is that moodboard and chart of hypertension in the middle of the article that isn't explained?

This is a great step up in open models however the pricing to support z.ai is not far cheaper than Claude / OpenAI subscription

by robertwt7

6/17/2026 at 5:21:03 PM

I'm curious what harness everyone is using for these? I want to start to test some of these open models but don't know what tools people use to get these working "agenticaly"

by daniban

6/17/2026 at 9:43:51 PM

I am using OpenCode with the DeepSeek API with some pretty good results.

by gorbypark

6/17/2026 at 6:05:09 PM

pi.dev and ask ai to add features you miss from claude or codex. i configure keyboard shortcuts and swap models easily

by zackify

6/17/2026 at 2:20:26 PM

I have a question, as it happens: Do you think the benchmarks and models were trained on benchmark datasets to skew the results, even though in real-world applications we realize they're not that great?

by RDTvlokip

6/17/2026 at 5:09:22 PM

Recent incident with the Rio 3.5 model clearly shows that many coding models are specifically trained/fine tuned for the benchmarks.

by sinuhe69

6/18/2026 at 3:01:05 PM

That's what I thought

by RDTvlokip

6/17/2026 at 3:06:22 PM

Hmmm... GLM insists it's Gemini.

https://github.com/zai-org/GLM-5/issues/79

by hereme888

6/17/2026 at 6:15:31 PM

Claude Sonnet 4.6 identified itself as DeepSeek repeatedly: https://www.reddit.com/r/DeepSeek/comments/1rd5jw7/claude_so...

I tested this myself a few months ago, and confirmed that it was really happening.

LLMs don't know who they are unless the system prompt tells them, and as all of them are trained on model responses that exist on the web that end up being scraped, the weights may predict a certain incorrect response. LLMs have no ability to introspect, and do not know anything about themselves, so they will hallucinate in response to that question unless they are carefully trained on that exact, pointless question.

by coder543

6/17/2026 at 9:37:34 PM

[flagged]

by killix

6/17/2026 at 3:46:14 PM

It's a surprisingly common misconception that models contain any metadata at all about themselves in their weights. If you ask them, "What model are you?" they either retrieve the answer from the system prompt, or they hallucinate an answer. Same goes for questions about knowledge cut-off, how many parameters they have, the source of their training data, etc.

by bityard

6/17/2026 at 4:21:04 PM

Huh. That kinda makes sense. So you think it's hallucinating it's model name?

by hereme888

6/17/2026 at 8:27:23 PM

Yes, it definitely is.

by bityard

6/17/2026 at 3:19:09 PM

Then why does it score better than any Gemini model?

by adastra22

6/17/2026 at 3:43:58 PM

As I understand, some people tend to "distill" LLM models. Google hasn't released a new Pro version in a while. I'm not an expert in LLMs.

by hereme888

6/17/2026 at 11:03:35 AM

It's a real step forward, getting closer to SOTA. It seems to be very epistemically cautious in its reasoning. I hope Deepseek and the other open-weights labs stay in the game and catch up too.

by creamyhorror

6/17/2026 at 2:43:50 PM

I asked z.ai what z.ai is, and it said "It seems you might be referring to xAI, as "z.ai" isn't a widely known or major AI company or platform at this time."

by jayess

6/17/2026 at 1:26:56 PM

This is really held back by one bench (omniscience accuracy) where it's really very far behind otherwise i think it's got at least a couple of points higher.

by KaoruAoiShiho

6/17/2026 at 11:54:28 AM

Ok, it is nice to see another great open source model. Not sure what to think of all these benchmarks but GLM was already quite strong before so an update is very welcome.

by hit8run

6/17/2026 at 1:34:18 PM

Regrettably I haven’t tried 5.2 yet but 5.1 I did not see as anything special. In practice I found it to be ~70% as good as Claude sonnet.

by Computer0

6/17/2026 at 9:05:48 PM

I'm a bit shocked that GLM 5.2 is not multimodal. Like, how should I use it? I use images all the time.

by PetrBrzyBrzek

6/17/2026 at 1:38:11 PM

DeepSeek v4 pro is still 10x cheaper than GLM-5.2 and the quality is still enough for 95% of coding tasks.

by piterrro

6/17/2026 at 1:44:07 PM

People always say stuff like this, but it is misleading. The reason it's misleading is because that remaining 5% makes a huge difference, and is where most of the value of using AI agents lies.

I'm not interested in using AI to write code that would have taken me 5-10 minutes to write myself. I use AI to debug complex bugs and develop large features that span multiple domains - stuff that normally takes hours, if not days/weeks. A model that is "enough for 95%" does not cut it for that, because the failures compound during long-horizon tasks and the thing becomes a mess.

by enraged_camel

6/17/2026 at 5:13:45 PM

I get what you mean. But for many people, AI coding is not about solving complex problems. No, they do it mostly themselves. AI coding for many is a productivity tool, where it helps you with mundane, but laborious tasks.

In my setup, I use a daily workhorse for such things. They should be fast, cheap and reasonably working well. I don’t expect it to be smart, but need it to follow instructions perfectly and handle tool calling well.

For architectural work or debugging help, I use the top models instead.

That works reasonably well for me with a low cost.

by sinuhe69

6/17/2026 at 1:46:59 PM

....so use DeepSeek v4 Pro for 95% of your coding tasks, and GLM 5.2 for the other 5%? You don't need to stick to one model.

by 0xbadcafebee

6/17/2026 at 10:33:45 AM

It’s pretty good. More talkative than 5.1. Reminds me of deepseek 4

Their servers are melting though - getting more timeouts etc

by Havoc

6/17/2026 at 11:54:39 AM

Sure, but whatever you do, don't buy their (Z.ai) lite plan.

I feel like i threw 15 dollars in the sea. I'm getting rate limited after 3-4 prompts. You get way less value than just paying 25 dollars for Claude or OpenAI models.

by eckelhesten

6/17/2026 at 12:30:08 PM

How are you using it? I have the lite plan and I've only ever maxed my weekly usage a few hours before reset. I will concede that I'm not a super heavy LLM user but it's been really good for me.

My workflow is usually:

- read file. I want to achieve X, how do? Do not implement anything.

- I would do a, b and c

- sketch a brief implementation of your suggestion

- <code> (not writing files yet)

- instead of your approach x, wouldn't it make sense to instead do z? What would that look like?

- <code>

- nice, implement this

- starts writing files, run tests, etc.

by granra

6/17/2026 at 1:08:01 PM

Try pointing it to a small codebase, or even ask it to conjure information found online.

You'll see that it quickly gives up. Thing is, they seem to count cached hits as if they were the non-cached tokens.

I wont be subscribing again thats for sure. I am not paying iPhone money for a Xiaomi.

by eckelhesten

6/17/2026 at 4:35:37 PM

That's what I've been doing. I use crush normally. While the codebase are by no means huge, they're not tiny either.

by granra

6/17/2026 at 5:19:19 PM

Are you using it in an agentic workflow? Just reading the codebase will consume a lot of cached tokens, but seemingly, z.ai counts these as normal input tokens the way they're rate limiting.

by eckelhesten

6/17/2026 at 5:36:55 PM

I'm not entirely sure what an agentic workflow could mean today but I think so. I use a coding agent (crush), prompt it to brainstorm an implementation with me (or sometimes I know exactly how I want to implement it but ask it to challenge it), correct any wrong assumptions or request the implementation to look differently than suggested if I don't like it. Then finally when I'm positive I've cleared the most important assumptions I ask it to actually write and edit files and run tests and such (this just ends up being a "implement this").

With any model I've tried I've found it to be a huge pain to have it fix things where it made a wrong assumption without the code becoming a mess and burning a lot of tokens. I'm aware that not everyone works like this but I'm still very opinionated on what the end result should look like so I can still work on it without an LLM.

by granra

6/17/2026 at 1:46:35 PM

Did you consider their peak hours and model usage multiplier? Read the green box https://docs.z.ai/devpack/overview#usage-instruction

I had the Lite plan, I NEVER maxed out the quota because I considered these things. If I, for example, switched over to GLM-5-Turbo, then I could've easily burned through quota.

by Alifatisk

6/17/2026 at 5:17:15 PM

I just read it and honestly it left an even worse taste in my mouth.

>GLM-5.2 and GLM-5-Turbo are advanced models designed to rival Claude Opus model. Its usage will be deducted at 3 × during peak hours and 2 × during off-peak hours.

Claude certainly does not punish me for using their best models. Why should this "up and coming" company do it?

I thought the up and coming ai companies was supposed to have some kind of leverage in terms of price/performance (see deepseeks insanely cheap V4 flash and pro).

by eckelhesten

6/17/2026 at 5:23:10 PM

With a claude code plan, can you generate as many tokens with Opus as you can with Haiku before filling your 5 hour window? The same is going on here.

by granra

6/17/2026 at 12:53:21 PM

Open-weight models are winning. The gap with closed models is now measured in months, not years.

by zftnb666

6/17/2026 at 11:43:17 AM

I tried it today through Openrouter and the API is atrocious. I got multiple rate limit and random errors every turn.

Somebody wrote [1]; "I am never touching Minimax or GLM again. Their APIs had constant outages and I had to restart my runs multiple times — after burning money on the runs that failed midway." and I 100% agree.

The model might be good, but if the API is so bad, it's effectively useless.

[1]: https://kasra.blog/blog/i-spent-1500-seeing-if-llms-could-ha...

by kissgyorgy

6/17/2026 at 1:02:55 PM

The entire point of this post is that it's open weights, you can run it yourself and don't have to deal with the API issues. You really do have that choice.

by segmondy

6/17/2026 at 2:17:01 PM

You could subscribe to Anthropic/OpenAI for the rest of your life for the cost it would take to host GLM5.2 locally - you need 1.5TB of VRAM just for the weights

by ac29

6/17/2026 at 2:29:49 PM

You don't need that much VRAM unless you're targeting a high-performance deployment that's intended to scale far beyond local use. For a lower-throughput case, you can keep the model weights on SSD at very low cost and stream them in for inference. This could actually scale reasonably well if you have something as simple as a previous-gen HEDT with a decent amount of PCIe lanes to host fast storage from.

by zozbot234

6/17/2026 at 11:47:52 AM

That’s what happens when you offer something decent at a fraction of the price of opus - more demand than you can serve

by Havoc

6/17/2026 at 2:28:40 PM

Give it a few days and additional provider will be up and available on OpenRouter. Then the game of figuring out who’s not nuking the weights and neutering the quantization begins.

by ComputerGuru

6/17/2026 at 11:49:26 AM

I indeed got a few timeouts yesterday using the official API, I imagine for the coding plan users it'll be even worse.

by osti

6/17/2026 at 10:36:20 AM

> GLM-5.2 sits off the most attractive quadrant on the Intelligence vs Output Tokens chart.

That is unfortunate...

by nh43215rgb

6/17/2026 at 8:48:42 PM

There's only one GLM in my heart: the one that includes vec3.hpp

by blt

6/17/2026 at 11:04:47 AM

Cerebras really needs to have this on their API list (if they even still exist).

by lousken

6/17/2026 at 11:09:57 AM

they went public a few weeks ago

by Marciplan

6/17/2026 at 11:40:35 AM

That's cool and all, but they are still on GLM 4.7

by lousken

6/17/2026 at 2:04:36 PM

Which is fine for their target market. Their latest model is Kimi K2.6, available to enterprise customers. But older models become more powerful when you have time to do more reasoning. Also many applications don't need advanced models. Cerebras is making bank from all the other use cases that SOTA providers left on the table by focusing on 0-shot intelligence over speed

by 0xbadcafebee

6/17/2026 at 12:06:48 PM

1m context btw.

by sourcecodeplz

6/17/2026 at 1:49:47 PM

And apparently, actual support for 1M context window, not just theoretical.

by Alifatisk

6/19/2026 at 5:24:26 AM

Mark my words, by the end of 2027, there will be an open weights model that is better than anything OpenAI and Anthropic are capable of making. They will lose at inference scaling too.

by casey2

6/17/2026 at 4:37:10 PM

why do not all open source LLM's have open weights like this model?

by adithyaharish

6/17/2026 at 6:43:33 PM

https://en.wikipedia.org/wiki/Artificial_scarcity

by bigyabai

6/17/2026 at 4:47:15 PM

"open source" means that the code itself (for LLMs - this is training code) is available to the general public. "open weights" means that the weights (trained over time) are available publicly, rather than locked behind a paywalled chat. I do not know of an open source LLM that is not also open weights (unless they never bothered training it). Models like Claude and Gemini are neither open source, nor are they open weights.

by Retro_Dev

6/18/2026 at 11:36:32 AM

Got it, thanks for the thought

by adithyaharish

6/17/2026 at 2:45:47 PM

It is a very useful model

by hyqzz8

6/18/2026 at 3:02:52 PM

Which American model did they distill this one from?

by catigula

6/17/2026 at 11:46:03 AM

looks like I need a GB300 workstation

by dsrtslnd23

6/17/2026 at 10:37:06 PM

[flagged]

by hottrends

6/17/2026 at 4:01:01 PM

[flagged]

by maxothex

6/17/2026 at 1:42:51 PM

[flagged]

by Asfand3099

6/17/2026 at 10:52:58 AM

[flagged]

by mohsen1

6/17/2026 at 11:41:10 AM

A yes, the stealth advertisement post ...

by benjiro29