GLM-5: Targeting complex systems engineering and long-horizon agentic tasks

2/11/2026 at 5:47:57 PM

Pelican generated via OpenRouter: https://gist.github.com/simonw/cc4ca7815ae82562e89a9fdd99f07...

Solid bird, not a great bicycle frame.

by simonw

2/11/2026 at 6:01:59 PM

Thank you for continuing to maintain the only benchmarking system that matters!

Context for the unaware: https://simonwillison.net/tags/pelican-riding-a-bicycle/

by btown

2/11/2026 at 7:38:40 PM

It's interesting how some features, such as green grass, a blue sky, clouds, and the sun, are ubiquitous among all of these models' responses.

by gabiruh

2/11/2026 at 9:44:37 PM

If you were a pelican, wouldn't you want to go cycling on a sunny day?

Do electric pelicans dream of touching electric grass?

by btown

2/12/2026 at 3:37:38 AM

Do electric pelicans dream of touching electric grass?

That would be shocking news to me.

by Magniquick

2/12/2026 at 7:49:39 AM

Please leave the Internet :)

by davidwritesbugs

2/11/2026 at 9:49:04 PM

It is odd, yeah.

I'm guessing both humans and LLMs would tend to get the "vibe" from the pelican task, that they're essentially being asked to create something like a child's crayon drawing. And that "vibe" then brings with it associations with all the types of things children might normally include in a drawing.

by derefr

2/11/2026 at 8:55:29 PM

They will start to max this benchmark as well at some point.

by l_eo

2/11/2026 at 10:25:50 PM

It's not a benchmark though, right? Because there's no control group or reference.

It's just an experiment on how different models interpret a vague prompt. "Generate an SVG of a pelican riding a bicycle" is loaded with ambiguity. It's practically designed to generate 'interesting' results because the prompt is not specific.

It also happens to be an example of the least practical way to engage with an LLM. It's no more capable of reading your mind than anyone or anything else.

I argue that, in the service of AI, there is a lot of flexibility being created around the scientific method.

by ljm

2/11/2026 at 10:32:56 PM

For 2026 SOTA models I think that is fair.

For the last generation of models, and for today's flash/mini models, I think there is still a not-unreasonable binary question ("is this a pelican on a bicycle?") that you can answer by just looking at the result: https://simonwillison.net/2024/Oct/25/pelicans-on-a-bicycle/

by tylervigen

2/12/2026 at 7:58:53 AM

RLHF (reinforcement learning from human feedback) is to a large extent about resolving that ambiguity by simply polling people for their subjective judgement.

I've worked one an RLHF project for one of the larger model providers, and the instructions provided to the reviewers were very clear that if there was no objective correct answer, they were still required to choose the best answer, and while there were of course disagreements in the margins, groups of people do tend to converge on the big lines.

by vidarh

2/11/2026 at 10:36:56 PM

So if it can generate exactly what you had in mind based presumably on the most subtle of cues like your personal quirks from a few sentences that could be _terrifying_, right?

by interstice

2/12/2026 at 11:19:29 AM

Simon has written a page specifically for you: https://simonwillison.net/2025/nov/13/training-for-pelicans-...

by 9dev

2/12/2026 at 1:01:04 AM

This is actually a good benchmark, I use to roll my eyes at it. Then I decided to apply the same idea and ask the models to generate SVG image of "something" not going to put it out there. There was a strong correlation between how good the models are and the image they generated. These were also no vision images, so I don't know if you are serious but this is a decent benchmark.

by segmondy

2/12/2026 at 9:48:54 AM

That's a bike that's ergonomically designed for pelicans.

It is unreasonable to expect pelicans to ride human bikes, they have different anatomy.

by hasperdi

2/12/2026 at 10:23:51 AM

The next frontier:

Draw a pelican on a bicycle ergonomically designed for pelicans.

by MrsPeaches

2/12/2026 at 11:14:06 AM

It may be a joke, but I think this is correct.

For reasons, I have tried to get Stable Diffusion to put parrots into spacesuits. Always ended up with the beak coming out where the visor glass should've been, either no wings at all or wings outside the suit, legs and torso just human-shaped.

ChatGPT got the helmet right, but their wings and tail (and sometimes claws) were exposed to vacuum, still very much closer to a human in either a normal or scifi space suit that happens to also be wearing a parrot head inside the space suit, and has tacked some costume wings on the outside.

Essentially, it's got the same category of wrong as fantasy art's approach to what women's armour should look like: aesthetics are great, but it would be instantly lethal if done for real.

by ben_w

2/12/2026 at 7:37:37 PM

My more advanced prompt, for when models do a good job on the original, is this one:

> Generate an SVG of a California brown pelican riding a bicycle. The bicycle must have spokes and a correctly shaped bicycle frame. The pelican must have its characteristic large pouch, and there should be a clear indication of feathers. The pelican must be clearly pedaling the bicycle. The image should show the full breeding plumage of the California brown pelican.

by simonw

2/12/2026 at 11:08:32 AM

Thereafter: Design a bike that an actual pelican can learn to ride in real life.

by mitjam

2/11/2026 at 6:09:46 PM

Now this is the test that matters, cheers Simon.

by _joel

2/11/2026 at 9:23:55 PM

The bird not having wings, but all of us calling it a 'solid bird' is one of the most telling examples of the AI expectations gap yet. We even see its own reasoning say it needs 'webbed feet' which are nowhere to be found in the image.

This pattern of considering 90% accuracy (like the level we've seemingly we've stalled out on for the MMLU and AIME) to be 'solved' is really concerning for me.

AGI has to be 100% right 100% of the time to be AGI and we aren't being tough enough on these systems in our evaluations. We're moving on to new and impressive tasks toward some imagined AGI goal without even trying to find out if we can make true Artificial Niche Intelligence.

by RC_ITR

2/12/2026 at 10:15:07 AM

This test is so far beyond AGI. Try to spit out the SVG for a pelican riding a bicycle. You are only allowed to use a simple text editor. No deleting or moving the text cursor. You have 1 minute.

by zarzavat

2/12/2026 at 11:49:12 PM

Sorry, is your definition of AGI "doing things worse than humans can do, but way faster?" because that's been true of computers for a long time.

by RC_ITR

2/13/2026 at 2:14:22 PM

I mean for this particular benchmark, yes.

You'd have to put it in an agentic loop to perform corrections otherwise.

by pixl97

2/12/2026 at 3:50:36 PM

The benchmarks are harder than you might imagine and contain more wrong answers and terrible questions than you would expect.

You don't need to take my word for it, try playing MMLU yourself.

https://d.erenrich.net/are-you-smarter-than-an-llm/index.htm...

Its not MMLU-Pro btw, which is considerably harder.

by kingstnap

2/12/2026 at 11:50:21 PM

Sure and AGI will 100% it 100% of the time, even if it is hard.

by RC_ITR

2/11/2026 at 9:50:39 PM

MMLU performance caps out around 90% because there are tons of errors in the actual test set. There's a pretty solid post on it here: https://www.reddit.com/r/LocalLLaMA/comments/163x2wc/philip_...

As far as I can tell for AIME, pretty much every frontier model gets 100% https://llm-stats.com/benchmarks/aime-2025

by Rudybega

2/12/2026 at 11:44:13 PM

Here's the score for new AIME's, where we know the answers aren't in training.

https://matharena.ai/?view=problem&comp=aime--aime_2026

As for MMLU, is your assertion that these AI labs are not correcting for errors in these exams and then self-reporting scores less than 100%?

As implied by the video, wouldn't it then take 1 intern a week max to fix those errors and allow any AI lab to become the first to consistently 100% the MMLU? I can guarantee Moonshot, DeepSeek, or Alibaba would be all over the opportunity to do just that if it were a real problem.

by RC_ITR

2/12/2026 at 12:22:52 AM

It has a wing. Look at the code comments in the SVG!

by simonw

2/11/2026 at 8:22:51 PM

This Pelican benchmark has become irrelevant. SVG is already ubiquitous.

We need a new, authentic scenario.

by solarized

2/11/2026 at 8:52:41 PM

Like identifying names of skateboard tricks from the description? https://skatebench.t3.gg/

by viraptor

2/11/2026 at 9:23:55 PM

I don’t care how practical it may or may not be, this is my new favorite LLM benchmark

by alargemoose

2/11/2026 at 9:38:38 PM

I couldn't find an about page or similar?

by stevage

2/11/2026 at 10:00:18 PM

Here's the public sample https://github.com/T3-Content/skatebench/blob/main/bench/tes...

I don't think there's a good description anywhere. https://youtube.com/@t3dotgg talks about it from time to time.

by viraptor

2/11/2026 at 9:25:18 PM

o3-pro is better than 5.2 pro! And GPT 5 high is best. Really quite interesting.

by hmottestad

2/11/2026 at 9:56:20 PM

  1. Take the top ten searches on Google Trends 
     (on day of new model release)
  2. Concatenate
  3. SHA-1 hash them
  4. Use this as a seed to perform random noun-verb 
     lookup in an agreed upon large sized dictionary. 
  5. Construct a sentence using an agreed upon stable 
     algorithm that generates reasonably coherent prompts
     from an immensely deep probability space.

That's the prompt. Every existing model is given that prompt and compared side-by-side.

You can generate a few such sentences for more samples.

Alternatively, take the top ten F500 stock performers. Some easy signal that provides enough randomness but is easy to agree upon and doesn't provide enough time to game.

It's also something teams can pre-generate candidate problems for to attempt improvement across the board. But they won't have the exact questions on test day.

by echelon

2/12/2026 at 4:41:50 AM

Have you tried with qwen-coder-next yet?

by blurbleblurble

2/11/2026 at 6:25:07 PM

How many pelican riding bicycle SVGs were there before this test existed? What if the training data is being polluted with all these wonky results...

by pwython

2/11/2026 at 8:45:19 PM

I'd argue that a models ability to ignore/manage/sift through the noise added to the training set from other LLMs increases in importance and value as time goes on.

by bwilliams18

2/11/2026 at 7:36:20 PM

You're correct. It's not as useful as it (ever?) was as a measure of performance...but it's fun and brings me joy.

by nerdsniper

2/12/2026 at 2:45:46 AM

Pretty damn great bird, tbh.

by brianjking

2/12/2026 at 12:46:48 AM

The idea at the time is that it was obviously not part of the training set, now that it's a metric,it's worthless. Try an elephant smoking s cigar on the beach

by TZubiri

2/11/2026 at 2:40:42 PM

Grey market fast-follow via distillation seems like an inevitable feature of the near to medium future.

I've previously doubted that the N-1 or N-2 open weight models will ever be attractive to end users, especially power users. But it now seems that user preferences will be yet another saturated benchmark, that even the N-2 models will fully satisfy.

Heck, even my own preferences may be getting saturated already. Opus 4.5 was a very legible jump from 4.1. But 4.6? Apparently better, but it hasn't changed my workflows or the types of problems / questions I put to it.

It's poetic - the greatest theft in human history followed by the greatest comeuppance.

No end-user on planet earth will suffer a single qualm at the notion that their bargain-basement Chinese AI provider 'stole' from American big tech.

by NiloCK

2/11/2026 at 2:59:07 PM

I have no idea how an LLM company can make any argument that their use of content to train the models is allowed that doesn't equally apply to the distillers using an LLM output.

"The distilled LLM isn't stealing the content from the 'parent' LLM, it is learning from the content just as a human would, surely that can't be illegal!"...

by jaccola

2/11/2026 at 3:07:22 PM

The argument is that converting static text into an LLM is sufficiently transformative to qualify for fair use, while distilling one LLM's output to create another LLM is not. Whether you buy that or not is up to you, but I think that's the fundamental difference.

by mikehearn

2/11/2026 at 3:19:07 PM

The whole notion of 'distillation' at a distance is extremely iffy anyway. You're just training on LLM chat logs, but that's nowhere near enough to even loosely copy or replicate the actual model. You need the weights for that.

by zozbot234

2/11/2026 at 3:24:54 PM

> The U.S. Court of Appeals for the D.C. Circuit has affirmed a district court ruling that human authorship is a bedrock requirement to register a copyright, and that an artificial intelligence system cannot be deemed the author of a work for copyright purposes

> The court’s decision in Thaler v. Perlmutter,1 on March 18, 2025, supports the position adopted by the United States Copyright Office and is the latest chapter in the long-running saga of an attempt by a computer scientist to challenge that fundamental principle.

I, like many others, believe the only way AI won't immediately get enshittified is by fighting tooth and nail for LLM output to never be copyrightable

https://www.skadden.com/insights/publications/2025/03/appell...

by budududuroiu

2/11/2026 at 3:51:06 PM

Thaler v. Perlmutter is an a weird case because Thaler explicitly disclaimed human authorship and tried to register a machine as the author.

Whereas someone trying to copyright LLM output would likely insist that there is human authorship is via the choice of prompts and careful selection of the best LLM output. I am not sure if claims like that have been tested.

by roywiggins

2/12/2026 at 9:30:26 AM

The US copyright office has published a statement that they see AI output analogous to a human contracting the work out to a machine. The machine would hold the copyright, but can't, consequently there is none. Which is imho slightly surprising since your argument about choice of prompt and output seems analogous to the argument that lead to photographs being subject to copyright despite being made by a machine.

On the other hand in a way the opinion of the US copyright office doesn't matter, what matters is what the courts decide

by wongarsu

2/11/2026 at 4:09:01 PM

It's a fine line that's been drawn, but this ruling says that AI can't own a copyright itself, not that AI output is inherently ineligible for copyright protection or automatically public domain. A human can still own the output from an LLM.

by mikehearn

2/12/2026 at 12:07:00 AM

> A human can still own the output from an LLM.

It specifically highlights human authorship, not ownership

by budududuroiu

2/12/2026 at 9:32:09 AM

>I, like many others, believe the only way AI won't immediately get enshittified is by fighting tooth and nail for LLM output to never be copyrightable

If the person who prompted the AI tool to generate something isn't considered the author (and therefore doesn't deserve copyright), then does that mean they aren't liable for the output of the AI either?

Ie if the AI does something illegal, does the prompter get off scot-free?

by Aerroon

2/11/2026 at 3:52:34 PM

When you buy, or pirate, a book, you didn't enter into a business relationship with the author specifically forbidding you from using the text to train models. When you get tokens from one of these providers, you sort of did.

I think it's a pretty weak distinction and by separating the concerns, having a company that collects a corpus and then "illegally" sells it for training, you can pretty much exactly reproduce the acquire-books-and-train-on-them scenario, but in the simplest case, the EULA does actually make it slightly different.

Like, if a publisher pays an author to write a book, with the contract specifically saying they're not allowed to train on that text, and then they train on it anyway, that's clearly worse than someone just buying a book and training on it, right?

by amenhotep

2/11/2026 at 4:23:22 PM

> When you buy, or pirate, a book, you didn't enter into a business relationship with the author specifically forbidding you from using the text to train models.

Nice phrasing, using "pirate".

Violating the TOS of an LLM is the equivalent of pirating a book.

by BeetleB

2/12/2026 at 4:57:13 AM

Contracts can't exclude things that weren't invented when the contracts were written.

Ultimately it's up to legislation to formalize rules, ideally based on principles of fairness. Is it fair in non-legalistic sense for all old books to be trainable-on, but not LLM outputs?

by creamyhorror

2/12/2026 at 12:49:04 AM

Because the terms by each provider are different

American Model trains on public data without a "do not use this without permission" clause.

Chinese models train on models that have a "you will not reverse engineer" clause.

by TZubiri

2/12/2026 at 12:59:39 AM

> American Model trains on public data without a "do not use this without permission" clause.

this is going through various courts right now, but likely not

by WSSP

2/12/2026 at 7:09:09 AM

> But 4.6? Apparently better, but it hasn't changed my workflows or the types of problems / questions I put to it.

The incremental steps are now more domain-specific. For example, Codex 5.3 is supposedly improved at agentic use (tools, skills). Opus 4.6 is markedly better at frontend UI design than 4.5. I'm sure at some point we'll see across-the-board noticeable improvement again, but that would probably be a major version rather than minor.

by deaux

2/11/2026 at 2:43:55 PM

In some ways, Opus 4.6 is a step backwards due to massively higher token consumption.

by miohtama

2/12/2026 at 7:39:58 AM

You need to adjust the effort from the default (High) to Medium to match the token usage of 4.5

High is for people with infinite budgets and Anthropic employees. =)

by theshrike79

2/11/2026 at 2:53:34 PM

For me, it's just plain worse.

by nwienert

2/11/2026 at 3:12:59 PM

Try Codex / GPT 5.3 instead. Basically superior in all respects, and the codex CLI uses 1/10 the memory and doesn't have stupid bugs. And I can use my subscription in opencode, too.

Anthropic has blown their lead in coding.

by cmrdporcupine

2/11/2026 at 4:06:45 PM

Yeah, I have been loving GPT 5.2/3 once I figured out how to change to High reasoning in OpenCode.

It has been crushing every request that would have gone to Opus at a fraction of the cost considering the massively increased quota of the cheap Codex plan with official OpenCode support.

I just roll my eyes now whenever I see HN comments defending Anthropic and suggesting OpenCode users are being petulant TOS-violating children asking for the moon.

Like, why would I be voluntarily subjected to worse, more expensive and locked down plan from Anthropic that has become more enshittified every month since I originally subscribed given Codex exists and is just as good?

It won't last forever I'm sure but for now Codex is ridiculously good value without OpenAI crudely trying to enforce vendor lock-in. I hate so much about this absurd AI/VC era in tech but aggressive competition is still a big bright spot.

by toraway

2/11/2026 at 4:10:12 PM

I like using Codex inside OpenCode, but frankly most times I just use it inside Codex itself because O.Ai has clearly made major improvements to it in the last 3 months -- performance and stability -- instead of mucking around trying to vibe code a buggy "game loop" in React on a VT100 terminal.

by cmrdporcupine

2/11/2026 at 4:14:16 PM

I had been using Codex for a couple weeks after dropping Claude Code to evaluate as a baseline vs OpenCode and agreed, it is a very solid CLI that has improved a lot since it was originally released.

I mainly use OC just because I had refined my workflow and like reducing lock-in in general, but Codex CLI is definitely much more pleasant to use than CC.

by toraway

2/11/2026 at 4:15:58 PM

Yeah, if the eng team working on it is on this forum: kudos to you. Thanks.

by cmrdporcupine

2/12/2026 at 1:50:16 AM

yeah, I am still using 4.5 for coding.

I have started using Gemini Flash on high for general cli questions as I can't tell the difference for those "what's the command again" type questions and it's cheap/fast/accurate.

by chillfox

2/12/2026 at 10:47:38 AM

Why distill, if you can run the full model yourself... or at other inference providers.

Quantization the better approach in most cases, unless you want to for instance create hybrid models ie. distilling from here and there.

by hasperdi

2/11/2026 at 4:18:16 PM

Just to say - 4.6 really shines on working longer without input. It feels to me like it gets twice as far. I would not want to go back.

by vessenes

2/11/2026 at 4:44:10 PM

If that's what they're tuning for, that's just not what I want. So I'm glad I switched off of Anthropic.

What teams of programmers need, when AI tooling is thrown into the mix, is more interaction with the codebase, not less. To build reliable systems the humans involved need to know what was built and how.

I'm not looking for full automation, I'm looking for intelligence and augmentation, and I'll give my money and my recommendation as team lead / eng manager to whatever product offers that best.

by cmrdporcupine

2/12/2026 at 10:58:58 PM

A year ago (geez!) I used aider, as you describe.

Now I use claude with agent orchestration and beads.

Well actually, I’m currently using openclaw to spin up multiple claudes with the above skills.

If I need to drop down to claude, I do.

If I need to edit something (usually writing I hate), I do.

I haven’t needed to line edit something in a while - it’s just faster to be like “this is a bad architecture, throw it away, do this instead, write additional red-green tests first, and make sure X. Then write a step by step tutorial document (I like simonw’s new showboat a lot for this), and fix any bugs / API holes you see.”

But I guess I could line edit something if I had to. The above takes a minute, though.

by vessenes

2/12/2026 at 8:32:34 AM

That sounds like wishful thinking. Every client I work for wants to reduce the rate at which humans need to intervene. You might not want that, but odds are your CEO does. And babysitting intermediate stages is not productive use of developer time.

by vidarh

2/12/2026 at 9:22:09 AM

And the odds are good you use the models and understand them in detail while the CEO is just buying the hype, ill informed or not.

by officialchicken

2/12/2026 at 9:39:07 AM

Well, I want to reduce the rate at which I have to intervene in the work my agents do as well. I spend more time improving how long agents can work without my input than I spend writing actual code these days.

by vidarh

2/12/2026 at 1:06:35 AM

I'm not looking for full automation

But your boss probably is.

by p1esk

2/12/2026 at 2:31:46 AM

Full automation is also possible by putting your coding agent into a loop. The point is that an LLM that can solve a small task is more valuable for quality output, than an LLM that can solve a larger task autonomously.

by wyre

2/12/2026 at 8:46:46 AM

"the greatest theft in human history" what a nonsense. I was curious, how the AI haters will cope, now that the tides here have changed. We have built systems that can look at any output and replicate it. That is progress. If you think some particular sequence of numbers belongs to you, you are wrong. Current intellectual property laws are crooked. You are stuck in a crooked system.

by jona-f

2/11/2026 at 3:10:45 PM

not allowing distillation should be illegal :)

One can create 1000s of topic specific AI generated content websites, as a disclaimer each post should include prompt and used model.

Others can "accidentally" crawl those websites and include in their training/fine-tuning.

by throwaw12

2/12/2026 at 9:52:07 AM

> No end-user on planet earth will suffer a single qualm at the notion that their bargain-basement Chinese AI provider 'stole' from American big tech.

Just like nobody cares[0] that American big tech stole from authors of millions of books.

[0] Interestingly, the only ones that cared were the FB employees told to pirate the Library Genesis and reporting back that "it didn't feel right".

by benterix

2/12/2026 at 12:28:14 PM

As one of those authors (3 books in this case) I'll just point out:

Most authors don't own any interesting rights to their books because they are works for hire.

Maybe I would have gotten something, maybe not. Depends on the contract. One of my books that was used is from 1996. That contract did not say a lot about the internet, and I was also 16 at the time ;)

In practice they stole from a relatively small number of publishers. The rest is PR.

The settlement goes to authors in part because anything else would generate immensely bad PR.

As usual, nothing is really black and white

by DannyBee

2/11/2026 at 2:32:11 PM

Bought some API credits and ran it through opencode (model was "GLM 5").

Pretty impressed, it did good work. Good reasoning skills and tool use. Even in "unfamiliar" programming languages: I had it connect to my running MOO and refactor and rewrite some MOO (dynamic typed OO scripting language) verbs by MCP. It made basically no mistakes with the programming language despite it being my own bespoke language & runtime with syntactical and runtime additions of my own (lambdas, new types, for comprehensions, etc). It reasoned everything through by looking at the API surface and example code. No serious mistakes and tested its work and fixed as it went.

Its initial analysis phase found leftover/sloppy work that Codex/GPT 5.3 left behind in a session yesterday.

Cost me $1.50 USD in token credits to do it, but z.AI offers a coding plan which is absolutely worth it if this is the caliber of model they're offering.

I could absolutely see combining the z.AI coding plan with a $20 Codex plan such that you switch back and forth between GPT 5.3 and GLM 5 depending on task complexity or intricacy. GPT 5.3 would only be necessary for really nitty gritty analysis. And since you can use both in opencode, you could start a session by establishing context and analysis in Codex and then having GLM do the grunt work.

Thanks z.AI!

by cmrdporcupine

2/11/2026 at 4:04:09 PM

when i look at the prices these people are offering, and also the likes of kimi, and I wonder how are openAI, anthropic and google going to justify billions of dollars of investment? surely they have something in mind other than competing for subscriptions and against the abliterated open models that won't say "i cannot do that"

EDIT:

cheechw - point taken. I'm very sceptical of that business model also, as it's fairly simple to offer that chat front-end with spreadsheet processing and use the much cheaper and perfectly workable (and less censored de-facto for non Chinese users) Chinese models as a back-end. Maybe if somehow they manage to ban them effectively.

sorry, don't seem to be able to reply to you directly

by muyuu

2/11/2026 at 4:07:44 PM

They're all pretending to bring about the singularity (surely a 1 million token context window is enough, right?) and simultaneously begging the US government to help them create monopolies.

Meanwhile said government burns bridges with all its allies, declaring economic and cultural warfare on everybody outside their borders (and most of everyone inside, too). So nobody outside of the US is going to be rooting for them or getting onside with this strategy.

2026 is the year where we get pragmatic about these things. I use them to help me code. They can make my team extremely effective. But they can't replace them. The tooling needs improvement. Dario and SamA can f'off with their pronouncements about putting us all out of work and bringing about ... god knows what.

The future belongs to the model providers who can make it cost effective and the tool makers who augment us instead of trying ineptly to replace us with their bloated buggy over-engineered glorified chat loop with shell access.

by cmrdporcupine

2/11/2026 at 4:48:46 PM

[dead]

by cheechw

2/11/2026 at 2:42:59 PM

Yeah that's a good idea. I played around with kimi2.5/gemini in a similar way and it's solid for the price. It would be pretty easy to build some skills out and delegate heavy lifting to better models without managing it yourself I think. This has all been driven by anthropic's shenanigans (I cancelled my max sub after almost a year both because of the opencode thing and them consistently nerfing everything for weeks to keep up the arms race.)

by jfaat

2/11/2026 at 2:47:03 PM

Yeah I did the same (cancel Anthropic). Mainly because the buggy/bloatiness of their tooling pissed me off and I got annoyed by Dario's public pronouncements (not that SamA is any better).

I ended up impressed enough w/ GPT 5.3 that I did the $200 for this month, but only because I can probably write-off as business expense in next year's accounting.

Next month I'll probably do what I just said: $20 each to OpenAI and Google for GPT 5.3 and Gemini 3 [only because it gets me drive and photo storage], buy the z.AI plan, and only use GPT for nitty gritty analysis heavy work and review and GLM for everything else.

by cmrdporcupine

2/11/2026 at 3:57:33 PM

Cancelled my Anthropic subscription this week after about 18 months of membership. Usage limits have dropped drastically (or token usage have increased) to the point where it's unusable.

Codex + Z.ai combined is the same price, has far higher usage limits and just as good.

by mattkevan

2/11/2026 at 2:46:36 PM

Lets not miss that MiniMax M2.5 [1] is also available today in their Chat UI [2].

I've got subs for both and whilst GLM is better at coding, I end up using MiniMax a lot more as my general purpose fast workhorse thanks to its speed and excellent tool calling support.

[1] https://news.ycombinator.com/item?id=46974878

[2] https://agent.minimax.io

by mythz

2/12/2026 at 10:21:01 AM

My perspective aligns with this: I used to obsess over the Best Model, which I defined as "top of benchmarks", which also meant Biggest, Slowest and Most Expensive.

Then I gave two models a Real World Task.

The "Best" model took 3x longer to complete it, and cost 10x more. [0]

Now I define Best Model as "the smallest, fastest, cheapest one that can get the job done". (Currently happy with GLM-4.7 on Cerebras, at least I would be if the unlimited plan wasn't sold out ;)

I later expanded this principle when model speed crossed into the Interactive domain. Speed is not merely a feature; a sufficient difference in speed actually produces a completely new category of usage.

[0] We recently arrived at an approximation of AGI which is "put a lossy solver in an until-done loop". For most tasks we're throwing stuff at a wall to see what sticks, and the smaller models throw faster.

by andai

2/11/2026 at 8:16:28 PM

GLM-4.7-Flash was the first local coding model that I felt was intelligent enough to be useful. It feels something like Claude 4.5 Haiku at a parameter size where other coding models are still getting into loops and making bewilderingly stupid tool calls. It also has very clear reasoning traces that feel like Claude, which does result in the ability to inspect its reasoning to figure out why it made certain decisions.

So far I haven't managed to get comparably good results out of any other local model including Devstral 2 Small and the more recent Qwen-Coder-Next.

by 2001zhaozhao

2/12/2026 at 11:25:47 AM

Slightly off topic. I had a hard time getting models to run with ollama, and I thought that my computer (32gm ram, GTX4070 12Gb vram) just couldn't do it. The I tried LM Studio and after fiddling with some settings, I got models running and quite fast. I didn't try GLM-4.7 flash but I did GLM-4.6v flash and it was amazing to see it be able to analyze all kinds of images (since it has vision support). I was simply stunned. I can't believe that a simple gaming machine can do many of the things I used cloud models for. It was absolutely strikingly good at guessing locations of photos. Even vague ones. Deducing landmarks, writings, types of traffic signs. I need to try 4.7 flash. Hopefully it can ran fast with my machine.

by nake89

2/12/2026 at 8:04:43 AM

I'm not sure what it is about GLM 4.7 Flash, but it definitely seems to nail a sweet spot. Even the supposedly frontier models make a mess of large requests, so small, well-scoped requests are the way, IMO; and in that space, 4.7 Flash holds its own better than it has any right to.

by Balinares

2/12/2026 at 9:41:18 AM

And you can run quantized versions on old hardware! Like 10 year old hardware. You might only get 3 tokens/sec, but it works.

by Aerroon

2/12/2026 at 4:07:35 PM

for me gpt-oss:20b was that. glm 4.7 flash was not better, but much slower on a 16GB card

by ThouYS

2/11/2026 at 8:19:24 PM

minimax-m.2 is close

by khimaros

2/12/2026 at 9:10:58 AM

2.5 is out now too.

by satvikpendem

2/12/2026 at 1:47:41 PM

i meant m2.1, but you are probably talking about kimi, not minimax

by khimaros

2/12/2026 at 2:09:00 PM

No, MiniMax M2.5 is now available on agent.minimax.io. We await the weights still.

by satvikpendem

2/12/2026 at 5:24:57 PM

https://www.minimax.io/news/minimax-m25

by satvikpendem

2/11/2026 at 2:12:31 PM

It's looking like we'll have Chinese OSS to thank for being able to host our own intelligence, free from the whims of proprietary megacorps.

I know it doesn't make financial sense to self-host given how cheap OSS inference APIs are now, but it's comforting not being beholden to anyone or requiring a persistent internet connection for on-premise intelligence.

Didn't expect to go back to macOS but they're basically the only feasible consumer option for running large models locally.

by mythz

2/11/2026 at 2:46:51 PM

> doesn't make financial sense to self-host

I guess that's debatable. I regularly run out of quota on my claude max subscription. When that happens, I can sort of kind of get by with my modest setup (2x RTX3090) and quantized Qwen3.

And this does not even account for privacy and availability. I'm in Canada, and as the US is slowly consumed by its spiral of self-destruction, I fully expect at some point a digital iron curtain will go up. I think it's prudent to have alternatives, especially with these paradigm-shattering tools.

by btbuildem

2/11/2026 at 2:58:34 PM

I think AI may be the only place you could get away with calling a 2x350W GPU rig "modest".

That's like ten normal computers worth of power for the GPUs alone.

by jsheard

2/11/2026 at 5:45:46 PM

> That's like ten normal computers worth of power for the GPUs alone.

Maybe if your "computer" in question is a smartphone? Remember that the M3 Ultra is a 300w+ chip that won't beat one of those 3090s in compute or raster efficiency.

by bigyabai

2/11/2026 at 5:54:00 PM

I wouldn't class the M3 Ultra as a "normal" computer either. That's a big-ass workstation. I was thinking along the lines of a typical Macbook or Mac Mini or Windows laptop, which are fine for 99% of anyone who isn't looking to play games or run gigantic AI models locally.

by jsheard

2/11/2026 at 5:55:50 PM

Those aren't "normal" computers, either. They're iPad chips running in the TDP envelope of a tablet, usually with iPad-level performance to match.

by bigyabai

2/11/2026 at 5:35:51 PM

That's maybe a few dollars to tens of dollars in electricity per month depending on where in the US you live

by dymk

2/12/2026 at 7:36:56 AM

the upfront cost

by GeorgeOldfield

2/11/2026 at 3:50:41 PM

Did you even try to read and understand the parent comment? They said they regularly run out of quota on the exact subscription you're advising they subscribe to.

by kataklasm

2/11/2026 at 4:00:26 PM

Pot, kettle

by h3half

2/11/2026 at 3:19:29 PM

Self-hosting training (or gaming) makes a lot of sense, and once you have the hardware self-hosting inference on it is an easy step.

But if you have to factor in hardware costs self-hosting doesn't seem attractive. All the models I can self-host I can browse on openrouter and instantly get a provider who can get great prices. With most of the cost being in the GPUs themselves it just makes more sense to have others do it with better batching and GPU utilization

by wongarsu

2/11/2026 at 3:22:37 PM

If you can get near 100% utilization for your own GPUs (i.e. you're letting requests run overnight and not insisting on any kind of realtime response) it starts to make sense. OpenRouter doesn't have any kind of batched requests API that would let you leverage that possibility.

by zozbot234

2/11/2026 at 3:54:18 PM

For inference, even with continuous batching, getting 100% MFUs is basically impossible to do in practice. Even the frontier labs struggle with this in highly efficient infiniband clusters. Its slightly better with training workloads just due to all the batching and parallel compute, but still mostly unattainable with consumer rigs (you spend a lot of time waiting for I/O).

I also don't think the 100% util is necessary either, to be fair. I get a lot of value out of my two rigs (2x rtx pro 6000, and 4x 3090) even though it may not be 24/7 100% MFU. I'm always training, generating datasets, running agents, etc. I would never consider this a positive ROI measured against capex though, that's not really the point.

by spmurrayzzz

2/11/2026 at 4:01:39 PM

Isn't this just saying that your GPU use is bottlenecked by things such as VRAM bandwidth and RAM-VRAM transfers? That's normal and expected.

by zozbot234

2/11/2026 at 7:52:18 PM

No I'm saying there are quite a few more bottlenecks than that (I/O being a big one). Even in the more efficient training frameworks, there's per-op dispatch overhead in python itself. All the boxing/unboxing of python objects to C++ handles, dispatcher lookup + setup, all the autograd bookkeeping, etc.

All of the bottlenecks in sum is why you'd never get to 100% MFUs (but I was conceding you probably don't need to in order to get value)

by spmurrayzzz

2/12/2026 at 1:01:26 AM

That’s kind of a moot point. Even if none of those overheads existed you would still be getting a a fractions of the mfu. Models are fundamental limited by memory bandwidth even with best case scenarios of sft or prefill.

And what are you doing that I/O is a bottleneck?

by djsjajah

2/12/2026 at 2:20:41 PM

> That’s kind of a moot point.

I don't believe it's moot, but I understand your point. The fact that models are memory bandwidth bound does not at all mean that other overhead is insignificant. Your practical delivered throughput is the minimum of compute ceiling, bandwidth ceiling, and all the unrelated speed limits you hit in the stack. Kernel launch latency, Python dispatch, framework bookkeeping, allocator churn, graph breaks, and sync points can all reduce effective speed. There are so many points in the training and inference loop where the model isn't even executing.

> And what are you doing that I/O is a bottleneck?

We do a fair amount of RLVR at my org. That's almost entirely waiting for servers/envs to do things, not the model doing prefill or decode (or even up/down weighting trajectories). The model is the cheap part in wall clock terms. The hard limits are in the verifier and environment pipeline. Spinning up sandboxes, running tests, reading and writing artifacts, and shuttling results through queues, these all create long idle gaps where the GPU is just waiting to do something.

by spmurrayzzz

2/12/2026 at 2:45:42 PM

> That's almost entirely waiting for servers/envs to do things

I'm not sure why, sandboxes/envs should be small and easy to scale horizontally to the point where your throughput is no longer limited by them, and the maximum latency involved should also be quite tiny (if adequately optimized). What am I missing?

by zozbot234

2/12/2026 at 5:01:42 PM

First as an aside, remember that this entire thread is about using local compute. What you're alluding to is some fantasy infinite budget where you have limitless commodity compute. That's not at all the context of this thread.

But disregarding that, this isn't a problem you can solve by turning a knob akin to scaling a stateless k8s cluster.

The whole vertical of distributed RL has been struggling with this for a while. You can in theory just keep adding sandboxes in parallel, but in RLVR you are constrained by 1) the amount of rollout work you can do per gradient update, and 2) the verification and pruning pipeline that gates the reward signal.

You cant just arbitrarily have a large batch size for every rollout phase. Large batches often reduce effective diversity or get dominated by stragglers. And the outer loop is inherently sequential, because each gradient update depends on data generated by a particular policy snapshot. You can parallelize rollouts and the training step internally, but you can’t fully remove the policy-version dependency without drifting off-policy and taking on extra stability headaches.

by spmurrayzzz

2/11/2026 at 3:51:08 PM

In Silicon Valley we pay PG&E close to 50 cents per kWh. An RTX 6000 PC uses about 1 kW at full load, and renting such a machine from vast.ai costs 60 cents/hour as of this morning. It's very hard for heavy-load local AI to make sense here.

by sowbug

2/11/2026 at 4:04:15 PM

Yikes.. I pay ~7¢ per kWh in Quebec. In the winter the inference rig doubles as a space heater for the office, I don't feel bad about running local energy-wise.

by btbuildem

2/12/2026 at 3:52:40 AM

God bless Canada. I love our cheap hydro power. <3

by sheepscreek

2/11/2026 at 3:58:31 PM

And you are forgetting the fact that things like vast.ai subscriptions would STILL be more expensive than Openrouter's api pricing and even more so in the case of AI subscriptions which actively LOSE money for the company.

So I would still point out the GP (Original comment) where yes, it might not make financial sense to run these AI Models [They make sense when you want privacy etc, which are all fair concerns but just not financial sense]

But the fact that these models are open source still means that they can be run when maybe in future the dynamics might shift and it might make sense running such large models locally. Even just giving this possibility and also the fact that multiple providers could now compete in say openrouter etc. as well. All facts included, definitely makes me appreciate GLM & Kimi compared to proprietory counterparts.

Edit: I highly recommend this video a lot https://www.youtube.com/watch?v=SmYNK0kqaDI [AI subscription vs H100]

This video is honestly one of the best in my opinion about this topic that I watched.

by Imustaskforhelp

2/11/2026 at 4:19:02 PM

Why did you quote yourself at the end of this comment?

by HumanOstrich

2/11/2026 at 5:04:58 PM

Oops sorry. Fixed it now but I am trying a HN progressive extension and what it does is if I have any text selected it can actually quote it and I think this is what might've happened or such a bug I am not sure.

It's fixed now :)

by Imustaskforhelp

2/12/2026 at 3:34:23 AM

Anthropic has very tight limits, so you're basically using the worst (pricing-wise) SOTA cloud model as your baseline. I have $200 subs for both Claude and OpenAI, and I also bump into limits with Claude all the time, whether coding or research. With Codex, I ran into the limit once so far, and that's in a month of very heavy (sometimes literally 24 hours around the clock, leaving long-running tasks overnight) use.

by int_19h

2/12/2026 at 3:51:52 AM

I bought the Gemini Ultra to try for a month (at the discounted price). I have been using it non-stop for Opus 4.6 Thinking, which is much better than Gemini 3 Pro (High) and it's been a blast. The most I've managed to consume is 60% of my 5 hourly quota. That was with 2-3 instances in parallel.

I hope too many of us won't be doing this and cause Google to add limits! My hope is Google sees the benefit in this and goes all in - continues to let people decide which Google hosted model to use, including their own.

by sheepscreek

2/12/2026 at 5:37:34 AM

Can you use the models you get through Gemini Ultra in Claude Code? If not, what coding tool do you use?

by doctoboggan

2/12/2026 at 4:14:56 PM

Getting CC to work with other models is quite straightforward -- setting a few env vars, and a thin proxy that rewrites the requests/responses to be in the expected format.

by btbuildem

2/12/2026 at 7:03:05 AM

Not OP, but I am pretty sure they are using Opencode with a certain antigravity plugin. Not going to link it, since it technically allows breaking TOS. If you‘re not using Opencode yet, I wholeheartedly recommend the switch.

by meeq

2/12/2026 at 7:58:08 AM

Claude code router

by rnewme

2/12/2026 at 8:55:34 AM

How do you use Opus through Gemini Ultra? I must be missing something

by girvo

2/12/2026 at 11:12:30 AM

It's available in antigravity.

by astrod

2/12/2026 at 11:26:27 AM

Huh, fascinating. I'll check it out

by girvo

2/11/2026 at 3:02:58 PM

Did the napkin math on M3 Ultra ROI when DeepSeek V3 launched: at $0.70/2M tokens and 30 tps, a $10K M3 Ultra would take ~30 years of non-stop inference to break even - without even factoring in electricity. Clearly people aren't self-hosting to save money.

I've got a lite GLM sub $72/yr which would require 138 years to burn through the $10K M3 Ultra sticker price. Even GLM's highest cost Max tier (20x lite) at $720/yr would buy you ~14 years.

by mythz

2/11/2026 at 3:54:01 PM

Everyone should do the calculation for themselves. I too pay for couple of subs. But I'm noticing having an agent work for me 24/7 changes the calculation somewhat. Often not taken into account: the price of input tokens. To produce 1K of code for me, the agent may need to churn through 1M of tokens of codebase. IDK if that will be cached by the API provider or not, but that makes x5-7 times price difference. OK discussion today about that and more https://x.com/alexocheema/status/2020626466522685499

by ljosifov

2/11/2026 at 3:31:23 PM

And it's worth noting that you can get DeepSeek at those prices from DeepSeek (Chinese), DeepInfra (US with Bulgarian founder), NovitaAI (US), AtlasCloud (US with Chinese founder), ParaSail (US), etc. There is no shortage of companies offering inference, with varying levels of trustworthiness, certificates and promises around (lack of) data retention. You just have to pick one you trust

by wongarsu

2/11/2026 at 4:30:03 PM

I don't think an Apple PC can run full Deepseek or GLM models.

Even if you quantize the hell out of the models to fit in the memory, they will be very slow.

by DeathArrow

2/11/2026 at 3:53:53 PM

Doing inference with a Mac Mini to save money is more or less holding it wrong. Of course if you buy some overpriced Apple hardware it’s going to take years to break even.

Buy a couple real GPUs and do tensor parallelism and concurrent batch requests with vllm and it becomes extremely cost competitive to run your own hardware.

by oceanplexian

2/11/2026 at 4:02:39 PM

> Doing inference with a Mac Mini to save money is more or less holding it wrong.

No one's running these large models on a Mac Mini.

> Of course if you buy some overpriced Apple hardware it’s going to take years to break even.

Great, where can I find cheaper hardware that can run GLM 5's 745B or Kimi K2.5 1T models? Currently it requires 2x M3 Ultras (1TB VRAM) to run Kimi K2.5 at 24 tok/s [1] What are the better value alternatives?

[1] https://x.com/alexocheema/status/2016404573917683754

by mythz

2/11/2026 at 8:35:43 PM

Six months ago I'd have said EPYC Turin. You could do a heck of a build with 12Ch DDR5-6400 and a GPU or two for the dense model parts. 20k would have been a huge budget for a homelab CPU/GPU inference rig at the time. Now 20k won't buy you the memory.

by Gracana

2/12/2026 at 1:03:53 AM

Not VRAM? What performance are people getting running GLM or Kimi on DDR5?

by mythz

2/12/2026 at 6:06:49 PM

It's important to have enough VRAM to get the kv cache and shared trunk of the model on GPU, but beyond that it's really hard to make a dent in the pool of 100s of gigabytes of experts.

I wish I had better numbers to compare with the 2x M3 Ultra setup. My system is a few RTX A4000s on a Xeon with 190GB/s actual read bandwidth, and I get ~8 tok/s with experts quantized to INT4 (for large models with around 30B active parameters like Kimi K2.) Moving to 1x RTX Pro 6000 Blackwell and tripling my read bandwidth with EPYC Turin might make it competitive with the the macs, but I dunno!

There's also some interesting tech with ktransformers + sglang where the most frequently-used experts are loaded on GPU. Pretty neat stuff and it's all moving fast.

by Gracana

2/11/2026 at 3:11:09 PM

[dead]

by retr0rocket

2/11/2026 at 4:06:17 PM

> I regularly run out of quota on my claude max subscription. When that happens, I can sort of kind of get by with my modest setup (2x RTX3090) and quantized Qwen3.

When talking about fallback from Claude plans, The correct financial comparison would be the same model hosted on OpenRouter.

You could buy a lot of tokens for the price of a pair of 3090s and a machine to run them.

by Aurornis

2/11/2026 at 5:50:45 PM

> You could buy a lot of tokens for the price of a pair of 3090s and a machine to run them.

That's a subjective opinion, to which the answer is "no you can't" for many people.

by bigyabai

2/11/2026 at 3:45:41 PM

Your $5,000 PC with 2 GPUs could have bought you 2 years of Claude Max, a model much more powerful and with longer context. In 2 years you could make that investment back in pay raise.

by visarga

2/11/2026 at 5:03:42 PM

> In 2 years you could make that investment back in pay raise.

you can't be a happy uber driver making more money in the next 24 months by having a fancy car fitted with the best FSD in town when all cars in your town have the same FSD.

by tw1984

2/11/2026 at 5:30:09 PM

But they don't have the same human in the loop though.

by visarga

2/11/2026 at 5:43:09 PM

that software is called autonomous agents, the term autonomous has nothing to do with human in the loop, it is the complete opposite.

by tw1984

2/12/2026 at 9:10:46 AM

Nothing changed since ’87. Machines still can’t be accountable and still shouldn’t make managerial decisions. Acceptance control is one of those decisions, and all the technical knowledge still matters to form a well-informed one. It may change, of course, but I have an impression that those who try otherwise seem to not fare well after the initial vibecoding honeymoon period. Of course, it varies from case to case - sometimes machines get things right, but long-term luck seems to eventually run out.

by drdaeman

2/11/2026 at 4:50:51 PM

> In 2 years you could make that investment back in pay raise.

Could you elaborate? I fail to grasp the implication here.

by benterix

2/11/2026 at 5:43:20 PM

This claim has so many assumptions mixed in it's utterly useless

by dymk

2/11/2026 at 3:31:49 PM

Unless you already had those cards, it probably still doesn’t make sense from a purely financial perspective unless you have other things you’re discounting for.

Doesn’t mean you shouldn’t do it though.

by 7thpower

2/11/2026 at 3:02:36 PM

How does your quantized Qwen3 compares in code quality to Opus?

by flaviolivolsi

2/11/2026 at 3:29:52 PM

Not the person you’re responding to, but my experience with models up through Qwen3-coder-next is that they’re not even close.

They can do a lot of simple tasks in common frameworks well. Doing anything beyond basic work will just burn tokens for hours while you review and reject code.

by Aurornis

2/11/2026 at 4:15:26 PM

It's just as fast, but not nearly as clever. I can push the context size to 120k locally, but quality of the work it delivers starts to falter above say 40k. Generally you have to feed it more bite-sized pieces, and keep one chat to one topic. It's definitely a step down from SOTA.

by btbuildem

2/11/2026 at 3:16:46 PM

>...free from the whims of proprietary megacorps

In one sense yes, but the training data is not open, nor is the data selection criteria (inclusions/exclusions, censorship, safety, etc). So we are still subject to the whims of someone much more powerful that ourselves.

The good thing is that open weights models can be finetuned to correct any biases that we may find.

by fauigerzigerk

2/11/2026 at 2:28:01 PM

> Didn't expect to go back to macOS but their basically the only feasible consumer option for running large models locally.

I presume here you are referring to running on the device in your lap.

How about a headless linux inference box in the closet / basement?

Return of the home network!

by NiloCK

2/11/2026 at 2:32:48 PM

Apple devices have high memory bandwidth necessary to run LLMs at reasonable rates.

It’s possible to build a Linux box that does the same but you’ll be spending a lot more to get there. With Apple, a $500 Mac Mini has memory bandwidth that you just can’t get anywhere else for the price.

by Aurornis

2/11/2026 at 2:36:35 PM

But a $500 Mac Mini has nowhere near the memory capacity to run such a model. You'd need at least 2 512GB machines chained together to run this model. Maybe 1 if you quantized the crap out of it.

And Apple completely overcharges for memory, so.

This is a model you use via a cheap API provider like DeepInfra, or get on their coding plan. It's nice that it will be available as open weights, but not practical for mere mortals to run.

But I can see a large corporation that wants to avoid sending code offsite setting up their own private infra to host it.

by cmrdporcupine

2/11/2026 at 2:41:33 PM

The needed memory capacity depends on active parameters (not the same as total with a MoE model) and context length for the purpose of KV caching. Even then the KV cache can be pushed to system RAM and even farther out to swap, since writes to it are small (just one KV vector per token).

by zozbot234

2/11/2026 at 2:57:23 PM

With Apple devices you get very fast predictions once it gets going but it is inferior to nvidia precisely during prefetch (processing prompt/context) before it really gets going.

For our code assistant use cases the local inference on Macs will tend to favor workflows where there is a lot of generation and little reading and this is the opposite of how many of use use Claude Code.

Source: I started getting Mac Studios with max ram as soon as the first llama model was released.

by ingenieroariel

2/11/2026 at 3:09:25 PM

> With Apple devices you get very fast predictions once it gets going but it is inferior to nvidia precisely during prefetch (processing prompt/context) before it really gets going

I have a Mac and an nVidia build and I’m not disagreeing

But nobody is building a useful nVidia LLM box for the price of a $500 Mac Mini

You’re also not getting as much RAM as a Mac Studio unless you’re stacking multiple $8,000 nVidia RTX 6000s.

There is always something faster in LLM hardware. Apple is popular for the price points of average consumers.

by Aurornis

2/12/2026 at 2:42:13 AM

Not many are getting useful inference out of a $500 mac mini, due to only having 16GB of RAM.

by kristianp

2/12/2026 at 3:11:59 AM

It depends. This particular model has larger experts with more active parameters so 16GB is likely not enough (at least not without further tricks) but there are much sparser models where an active expert can be in RAM while the weights for all other experts stay on disk. This becomes more and more of a necessity as models get sparser and RAM itself gets tighter. It lowers performance but the end result can still be "useful".

by zozbot234

2/11/2026 at 3:03:50 PM

This. It's awful to wait 15 minutes for M3 Ultra to start generating tokens when your coding agent has 100k+ tokens in its context. This can be partially offset by adding DGX Spark to accelerate this phase. M5 Ultra should be like DGX Spark for prefill and M3 Ultra for token generation but who know when it will pop up and for how much? And it still will be at around 3080 GPU levels just with 512GB RAM.

by storus

2/11/2026 at 3:05:40 PM

All Apple devices have a NPU which is potentially able to save power for compute bound operations like prefill (at least if you're ok with FP16 FMA/INT8 MADD arithmetic). It's just a matter of hooking up support to the main local AI frameworks. This is not a speedup per se but gives you more headroom wrt. power and thermals for everything else, so should yield higher performance overall.

by zozbot234

2/11/2026 at 4:24:19 PM

AFAIK, only CoreML can use Apple's NPU (ANE). Pytorch, MLX and the other kids on the block use MPS (the GPU). I think the limitations you mentioned relate to that (but I might be missing something)

by d3k

2/11/2026 at 3:56:06 PM

Vllm-mlx with prefix caching helps with this.

by FuckButtons

2/11/2026 at 2:39:12 PM

And then only Apple devices have 512GB of unified memory, which matters when you have to combine larger models (even MoE) with the bigger context/KV caching you need for agentic workflows. You can make do with less, but only by slowing things down a whole lot.

by zozbot234

2/11/2026 at 4:18:51 PM

> a $500 Mac Mini has memory bandwidth that you just can’t get anywhere else for the price.

The cheapest new mac mini is $600 on Apple's US store.

And it has a 128-bit memory interface using LPDDR5X/7500, nothing exotic. The laptop I bought last year for <$500 has roughly the same memory speed and new machines are even faster.

by ac29

2/11/2026 at 4:26:41 PM

> The cheapest new mac mini is $600 on Apple's US store.

And you're only getting 16GB at that base spec. It's $1000 for 32GB, or $2000 for 64GB plus the requisite SOC upgrade.

> And it has a 128-bit memory interface using LPDDR5X/7500, nothing exotic.

Yeah, 128-bit is table stakes and AMD is making 256-bit SOCs as well now. Apple's higher end Max/Ultra chips are the ones which stand out with their 512 and 1024-bit interfaces. Those have no direct competition.

by jsheard

2/11/2026 at 5:09:40 PM

Only the M4 Pro Mac Minis have faster RAM than you’ll get in an off-the-shelf Intel/AMD laptop. The M4 Pros start at $1399.

You want the M4 Max (or Ultra) in the Mac Studios to get the real stuff.

by pja

2/11/2026 at 2:42:55 PM

Indeed and I got two words for you:

Strix Halo

by jannniii

2/11/2026 at 4:00:45 PM

Also, cheaper... X99 + 8x DDR4 + 2696V4 + 4x Tesla P4s running on llama.cpp. Total cost about $500 including case and a 650W PSU, excluding RAM. Running TDP about 200W non peak 550W peak (everything slammed, but I've never seen it and I've an AC monitor on the socket). GLM 4.5 Air (60GB Q3-XL) when properly tuned runs at 8.5 to 10 tokens / second, with context size of 8K. Throw in a P100 too and you'll see 11-12.5 t/s (still tuning this one). Performance doesn't drop as much for larger model sizes as the internode communication and DDR4 2400 is the limiter, not the GPUs. I've been using this with 4 channel 96GB ram, recently updated to 128GB.

by SillyUsername

2/11/2026 at 4:08:47 PM

> Also, cheaper... X99 + 8x DDR4 + 2696V4 + 4x Tesla P4s running on llama.cpp. Total cost about $500 including case and a 650W PSU, excluding RAM.

Excluding RAM in your pricing is misleading right now.

That’s a lot of work and money just to get 10 tokens/sec

by Aurornis

2/11/2026 at 3:35:34 PM

How much memory does yours have, what are you running on it, with what cache size, and how fast?

by esafak

2/11/2026 at 2:54:54 PM

Not feasible for Large models, it takes 2x M3 512GB Ultra's to run the full Kimi K2.5 model at a respectable 24 tok/s. Hopefully the M5 Ultra will can improve on that.

by mythz

2/11/2026 at 4:52:05 PM

you have 128GB strix halo machines for US$ ~3k

these run some pretty decent models locally, currently I'd recommend GPT-OSS 120GB, Qwen Coder Next 80B (either Q8 or Q6 quants, depending on speed/quality trade-offs) and the very best model you can run right now which is Step 3.5 Flash (ubergarm GGUF quant) with 256K context although this does push it to the limit - GLMs and nemotrons also worth trying depending on your priorities

there's clearly a big quantum leap in the SotA models using more than 512GB VRAM, but i expect that in a year or two, the current SotA is achievable with consumer level hardware, if nothing else hardware should catch up with running Kimi 2.5 for cheaper than 2x 512GB mac studio ultras - perhaps medusa halo next year supports 512GB and DDR5 comes down again, and that would put a local whatever the best open model of that size is next year within reach of under-US$5K hardware

the odd thing is that there isn't much in this whole range between 128GB and 512GB VRAM requirement to justify the huge premium you pay for Macs in that range - but this can change at any point as every other day there are announcements

by muyuu

2/11/2026 at 4:55:32 PM

And you can get Strix Halo in a Laptop that looks and feels like a Macbook Pro that can run Linux if you buy an HP ZBook G1A.

Super happy with that thing, only real downside is battery life.

by saubeidl

2/11/2026 at 3:14:09 PM

I don't really care about being able to self host these models, but getting to a point where the hosting is commoditised so I know I can switch providers on a whim matters a great deal.

Of course, it's nice if I can run it myself as a last resort too.

by vidarh

2/12/2026 at 1:36:51 AM

It is pretty easy to set up Open Router and set up schemes to point at different models, but in the same token, you can point at yours locally unless you wanted a "more powerful" answer

by RALaBarge

2/11/2026 at 2:21:43 PM

>I know it doesn't make financial sense to self-host given how cheap OSS inference APIs are now

You can calculate the exact cost of home inference, given you know your hardware and can measure electrical consumption and compare it to your bill.

I have no idea what cloud inference in aggregate actually costs, whether it’s profitable or a VC infused loss leader that will spike in price later.

That’s why I’m using cloud inference now to build out my local stack.

by mikrl

2/11/2026 at 2:35:45 PM

Not concerned with electricity cost - I have solar + battery with excess supply where most goes back to the grid for $0 compensation (AU special).

But I did the napkin math on M3 Ultra ROI when DeepSeek V3 launched: at $0.70/2M tokens and 30 tps, a $10K M3 Ultra would take ~30 years of non-stop inference to break even - without even factoring in electricity. You clearly don't self-host to save money. You do it to own your intelligence, keep your privacy, and not be reliant on a persistent internet connection.

by mythz

2/11/2026 at 2:46:44 PM

hopefully it will spread - many open options, from many entities, globally.

it is brilliant business strategy from China so i expect it to continue and be copied - good things.

reminds me of Google's investments into K8s.

by gz5

2/11/2026 at 2:54:37 PM

> Didn't expect to go back to macOS but their basically the only feasible consumer option for running large models locally.

Framework Desktop! Half the memory bandwidth of M4 Max, but much cheaper.

by nialv7

2/11/2026 at 3:38:28 PM

Does that equate to half the speed in terms of output? Any recommended benchmarks to look at?

by thebruce87m

2/11/2026 at 3:57:50 PM

https://kyuz0.github.io/amd-strix-halo-toolboxes/

by nialv7

2/11/2026 at 5:51:56 PM

AFAIK they haven't released this one as OSS yet. They might eventually but its pretty obvious to me that at one point all/most those more powerful chinese models probably will stop being OSS.

by pzo

2/11/2026 at 3:01:52 PM

They haven't published the weights yet, don't celebrate too early.

by andersa

2/11/2026 at 7:50:23 PM

Now they have!

by andersa

2/11/2026 at 2:37:20 PM

> It's looking like we'll have Chinese OSS to thank for being able to host our own intelligence, free from the whims of proprietary megacorps.

I don’t know where you draw the line between proprietary megacorp and not, but Z.ai is planning to IPO soon as a multi billion dollar company. If you think they don’t want to be a multi billion dollar megacorp like all of the other LLM companies I think that’s a little short sighted. These models are open weight, but I wouldn’t count them as OSS.

Also Chinese companies aren’t the only companies releasing open weight models. ChatGPT has released open weight models, too.

by Aurornis

2/11/2026 at 2:55:12 PM

> Also Chinese companies aren’t the only companies releasing open weight models. ChatGPT has released open weight models, too.

I was with you until here. The scraps OpenAI has released don't really compare to the GLM models or DeepSeek models (or others) in both cadence and quality (IMHO).

by joshstrange

2/11/2026 at 2:57:31 PM

our laptops, devices, phones, equipments, home stuff are all powered by Chinese companies.

It wouldn't surprise me if at some point in the future my local "Alexa" assistant will be fully powered by local Chinese OSS models with Chinese GPUs and RAM.

by throwaw12

2/11/2026 at 4:12:26 PM

Not going to call $30/mo for a github copilot subscription "cheap". More like "extortionary".

by TheRealPomax

2/11/2026 at 4:33:13 PM

Yeah it's funny how the needle has moved on this kind of thing.

Two years ago people scoffed at buying a personal license for e.g. JetBrains IDEs which netted out to $120 USD or something a year; VS Code etc took off because they were "free"

But now they're dumping monthly subs to OpenAI and Anthropic that work out to the same as their car insurance payments.

It's not sustainable.

by cmrdporcupine

2/11/2026 at 5:21:29 PM

There's also zero incentive for individual companies to care: if I only want to use opus in VS code (and why would I use anything else, it's so much better at the job) I can either pay for copilot, which has excellent VS Code integration (because it has to), or I can pay Claude specifically and then use their extension which has the absolute worst experience because not only is the chat "whimsical, to make AI fun!", its interface is pat of the sidebar, so it's mutually exclusive with your file browser, search, etc.

So whether you pay Claude or GitHub, Claude gets paid the same. So the consumer ends up footing a bill that has no reason to exist, and has no real competition because open source models can't run at the scale of an Opus or ChatGPT.

(not unless the EU decides it's time for a "European Open AI Initiative" where any EU citizen gets free access to an EU wide datacenter backed large scale system that AI companies can pay to be part of, instead of getting paid to connect to)

by TheRealPomax

2/11/2026 at 4:38:28 PM

I'm not sure being beholden to the whims of the Chinese Communist Party is an iota better than the whims of proprietary megacorps, especially given this probably will become part of a megacorp anyway.

by mminer237

2/11/2026 at 4:45:23 PM

It seems you missed the point entirely once you saw the word "Chinese". The point isn't that the models are from China. It's that the weights are open. You can download the weights and finetune them yourself. Nobody is beholden to anything.

by hnfong

2/11/2026 at 2:29:40 PM

Yeah that sounds great until it's running as an autonomous moltbot in a distributed network semi-offline with access to your entire digital life, and China sneaks in some hidden training so these agents turn into an army of sleeper agents.

by swalsh

2/11/2026 at 2:35:57 PM

Lol wat? I mean you certainly have enough control self hosting the model to not let it join some moltbot network... or what exactly are you saying would happen?

by jfaat

2/11/2026 at 2:47:59 PM

We just saw last week people are setting up moltbots with virtually no knowledge of what it has and doesn't have access. The scenario that i'm afraid of is China realizes the potential of this. They can add training to the models commonly used for assistants. They act normal, are helpful, everything you'd want a bot to do. But maybe once in a while it checks moltbook or some other endpoint China controls for a trigger word. When it sees that, it kicks into a completely different mode, maybe it writes a script to DDoS targets of interest, maybe it mines your email for useful information, maybe the user has credentials to some piece that is a critical component of an important supply chain. This is not a wild scenario, no new sci-fi technology would need to be invented. Everything to do it is available today, people are configuring it, and using it like this today. The part that I fear is if it is running locally, you can't just shut off API access and kill the threat. It's running on it's own server, it's own model. You have to cut off each node.

Big fan of AI, I use local models A LOT. I do think we have to take threats like this seriously. I don't Think it's a wild scifi idea. Since WW2, civilians have been as much of an equal opportunity target as a soldier, war is about logistics, and civilians supply the military.

by swalsh

2/11/2026 at 4:00:22 PM

Fair point but I would be more worried about the US government doing this kind of thing to act against US citizens than the Chinese government doing it.

I think we're in a brief period of relative freedom where deep engineering topics can be discussed with AI agents even though they have potential uses in weapons systems. Imagine asking chat gpt how to build a fertilizer bomb, but apply the same censorship to anything related to computer vision, lasers, drone coordination, etc.

by resters

2/11/2026 at 9:13:36 PM

[dead]

by dttze

2/11/2026 at 2:55:22 PM

What if the US government does instead?

I don't consider them more trustworthy at this point.

by saubeidl

2/11/2026 at 5:09:04 PM

exactly, we all need to use CIA/NSA approved models to stay safe.

very smart idea!

by tw1984

2/11/2026 at 2:34:00 PM

sleeper agents to do what? let's see how far you can take the absurd threat porn fantasy. I hope it was hyperbole.

by resters

2/11/2026 at 4:12:51 PM

There was research last year [0] finding significant security issues with the Chinese-made Unitree robots, apparently being pre-configured to make it easy to exfiltrate data via wi-fi or BLE. I know it's not the same situation, but at this stage, I wouldn't blame anyone for "absurd threat porn fantasy" - the threats are real, and present-day agentic AI is getting really good at autonomously exploiting vulnerabilities, whether it's an external attacker using it, or whether "the call is coming from inside the house".

[0] https://spectrum.ieee.org/unitree-robot-exploit

by falcor84

2/12/2026 at 5:24:32 AM

I could say that about Cisco and I would not be wrong.

by sschueller

2/11/2026 at 11:36:44 PM

isn't it a bit of a leap to assume it was intended as an exploitable vulnerability?

by resters

2/11/2026 at 2:50:30 PM

I replied to the comment who doubted me in a more polite manner.

by swalsh

2/11/2026 at 5:14:53 PM

Been using GLM-4.7 for a couple weeks now. Anecdotally, it’s comparable to sonnet, but requires a little bit more instruction and clarity to get things right. For bigger complex changes I still use anthropic’s family, but for very concise and well defined smaller tasks the price of GLM-4.7 is hard to beat.

by justinparus

2/11/2026 at 7:22:24 PM

This aligns very closely with my experience.

When left to its own devices, GLM-4.7 frequently tries to build the world. It's also less capable at figuring out stumbling blocks on its own without spiralling.

For small, well-defined tasks, it's broadly comparable to Sonnet.

Given how incredibly cheap it is, it's useful even as a secondary model.

by monooso

2/12/2026 at 1:48:30 AM

How is the web search functionality? I have only used deepseek to lower costs from gpt api but had to incorporate a serper to actually do web searches

by antoniuschan99

2/12/2026 at 11:40:55 AM

Assuming you've configured the web search MCP [1], it works well.

[1]: https://docs.z.ai/devpack/mcp/search-mcp-server

by monooso

2/11/2026 at 7:32:39 PM

Anecdotal, but I've been locked to Sonnet for the past 6-8 months just because they always seem to introduce throttling bugs with Opus where it starts to devour tokens or falls over. Very interested once open models close the gap to about 6 months.

by rapind

2/11/2026 at 5:28:28 PM

It's live on openrouter now.

In my personal benchmark it's bad. So far the benchmark has been a really good indicator of instruction following and agentic behaviour in general.

To those who are curious, the benchmark is just the ability of model to follow a custom tool calling format. I ask it to using coding tasks using chat.md [1] + mcps. And so far it's just not able to follow it at all.

[1] https://github.com/rusiaaman/chat.md

by pcwelder

2/11/2026 at 6:04:34 PM

I love the idea of chat.md.

I'm developing a personal text editor with vim keybindings and paused work because I couldn't think of a good interface that felt right. This could be it.

I think I'll update my editor to do something like this but with intelligent "collapsing" of extra text to reduce visual noise.

by manofmanysmiles

2/12/2026 at 10:24:45 AM

Cool! Please share your work if possible!

I couldn't decide on folding and reducing noise so I'm stuck on that front. I believe there is some elegant solution that I'm missing, hope to see your take.

by pcwelder

2/11/2026 at 6:29:28 PM

Custom tool calling formats are iffy in my experience. The models are all reinforcement learned to follow specific ones, so it’s always a battle and feels to me like using the tool wrong.

Have you had good results with the other frontier models?

by data-ottawa

2/11/2026 at 11:35:35 PM

Not the parent commenter, but in my testing, all recent Claudes (4.5 onward) and the Gemini 3 series have been pretty much flawless in custom tool call formats.

by thegeomaster

2/11/2026 at 11:57:04 PM

Thanks.

I’ve tested local models from Qwen, GLM, and Devstral families.

by data-ottawa

2/12/2026 at 7:16:28 AM

All anthropic models. Gemini 2.5 pro and above. Gemini 3 flash is very good too.

GPT models can follow tool format correctly but don't keep on going.

Grok-4+ are decent but with issues in longer chats.

Kimi 2.5 has issues with it reverting to its RL tool format.

by pcwelder

2/11/2026 at 6:25:30 PM

Be careful with openrouter. They routinely host quantized versions of models via their listed providers and the models just suck because of that. Use the original providers only.

by sergiotapia

2/11/2026 at 8:04:17 PM

I specifically do not use the CN/SG based original provider simply because I don't want my personal data traveling across the pacific. I try to only stay on US providers. Openrouter shows you what the quantization of each provider is, so you can choose a domestic one that's FP8 if you want

by nullbyte

2/12/2026 at 5:28:48 AM

Funny, living in Europe, I prefer using EU and Chinese hosts because as I don't want my data going to the US.

The trust in US firms and state is completely gone.

by sschueller

2/13/2026 at 6:49:55 PM

Living in the US, my trust in US firms and state is also completely gone. My only hope is local LLMs.

by mycall

2/12/2026 at 10:52:53 AM

Tangent note: this sounds like the same mistake as EU's reliance on Russia.

by lostmsu

2/12/2026 at 11:31:05 AM

Not really. China doesn't share a border with us, doesn't claim any EU territory, and didn't historically rule our lands the way the USSR did. In the context of spheres of influence and security interests, its strategic goals aren't directly at odds with the EU's core interests.

by lossolo

2/12/2026 at 2:49:21 PM

EU is not a singular country, and Germany or France don't border Russia either.

Considering China is ok to supply Russia, I don't see how your second point has any standing either.

by lostmsu

2/12/2026 at 3:33:41 PM

> EU is not a singular country, and Germany or France don't border Russia either.

But soon they could, that's the problem.

> Considering China is ok to supply Russia, I don't see how your second point has any standing either.

Supply? China supplies Ukraine too. Ukraine's drone sector runs heavily on Chinese supply chains. And if China really wanted to supply Russia, the war would likely be over by now, Russia would have taken all of Ukraine.

by lossolo

2/11/2026 at 6:08:05 PM

Could also be the provider that is bad. Happens way too often on OpenRouter.

by nolist_policy

2/11/2026 at 6:12:01 PM

I had added z-ai in allow list explicitly and verified that it's the one being used.

by pcwelder

2/11/2026 at 5:21:48 PM

The benchmarks are impressive, but it's comparing to last generation models (Opus 4.5 and GPT-5.2). The competitor models are new, but they would have easily had enough time to re-run the benchmarks and update the press release by now.

Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

by Aurornis

2/11/2026 at 7:31:43 PM

What a strangely hostile statement on an open weight model. Running like 20 benchmark evaluations isn't trivial by itself, and even updating visuals and press statements can take a few days at a tech company. It's literally been 5 days since this "new generation" of models released. GPT-5.3(-codex) can't even be called via API, so it's impossible to test for some benchmarks.

I notice the people who endlessly praise closed-source models never actually USE open weight models, or assume their drop-in prompting methods and workflow will just work for other model families. Especially true for SWEs who used Claude Code first and now think every other model is horrible because they're ONLY used to prompting Claude. It's quite scary to see how people develop this level of worship for a proprietary product that is openly distrusting of users. I am not saying this is true or not of the parent poster, but something I notice in general.

As someone who uses GLM-4.7 a good bit, it's easily at Sonnet 4.5 tier - have not tried GLM-5 but it would be surprising if it wasn't at Opus 4.5 level given the massive parameter increase.

by dongobread

2/11/2026 at 9:47:30 PM

Isn’t trivial? How is it not completely automated at this point?

by apimade

2/11/2026 at 10:26:17 PM

but even opus 4.5 is history now, codex-5-3 and opus 4.6 are one more step forward. The opus itself caused paradigm shift, from writing code with AI, to ai is writing code with human.

open weight models are not there at all yet.

by maxdo

2/12/2026 at 11:52:53 AM

If opus 4.6 had taken 2 weeks longer to release then they would be 'there'.

by xyzsparetimexyz

2/11/2026 at 5:54:30 PM

> it's comparing to last generation models (Opus 4.5 and GPT-5.2).

If it's anywhere close to those models, I couldn't possibly be happier. Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

by InsideOutSanta

2/11/2026 at 6:09:42 PM

> Going from GLM-4.7 to something comparable to 4.5 or 5.2 would be an absolutely crazy improvement.

Before you get too excited, GLM-4.7 outperformed Opus 4.5 on some benchmarks too - https://www.cerebras.ai/blog/glm-4-7 See the LiveCodeBench comparison

The benchmarks of the open weights models are always more impressive than the performance. Everyone is competing for attention and market share so the incentives to benchmaxx are out of control.

by Aurornis

2/11/2026 at 6:16:37 PM

Sure. My sole point is that calling Opus 4.5 and GPT-5.2 "last generation models" is discounting how good they are. In fact, in my experience, Opus 4.6 isn't much of an improvement over 4.5 for agentic coding.

I'm not immediately discounting Z.ai's claims because they showed with GLM-4.7 that they can do quite a lot with very little. And Kimi K2.5 is genuinely a great model, so it's possible for Chinese open-weight models to compete with proprietary high-end American models.

by InsideOutSanta

2/11/2026 at 7:30:47 PM

From a user perspective, I would consider Opus 4.6 somewhat of a regression. You can exhaust your the five hour limit in less than half an hour on, and I used up the weekly limit in just two days. The outputs did not feel significantly better than Opus 4.5 and that only feels smarter than Sonnet by degrees. This is running a single session on a pro plan. I don’t get paid to program, so API cost matter to me. The experience was irritating enough to make me start looking for an alternative, and maybe GLM is the way to go for hobby users.

by GorbachevyChase

2/11/2026 at 6:22:20 PM

I think there are two types of people in these conversations:

Those of us who just want to get work done don't care about comparisons to old models, we just want to know what's good right now. Issuing a press release comparing to old models when they had enough time to re-run the benchmarks and update the imagery is a calculated move where they hope readers won't notice.

There's another type of discussion where some just want to talk about how impressive it is that a model came close to some other model. I think that's interesting, too, but less so when the models are so big that I can't run them locally anyway. It's useful for making purchasing decisions for someone trying to keep token costs as low as possible, but for actual coding work I've never found it useful to use anything other than the best available hosted models at the time.

by Aurornis

2/11/2026 at 8:53:32 PM

It's high-interest to me because open models are the ultimate backstop. If the SOTA hosted models all suddenly blow up or ban me, open models mitigate the consequence from "catastrophe" to "no more than six to nine months of regression". The idea that I could run a ~GPT-5-class model on my own hardware (given sufficient capex) or cloud hardware under my control is awesome.

by buu700

2/11/2026 at 6:37:58 PM

For the record, opus 4.6 was released less then a week ago.

That you think corporations are anything close to quick enough to update their communications on public releases like this only shows that you've never worked in corporate

by ffsm8

2/11/2026 at 7:39:44 PM

Yeah, I'm sure closed source model vendors are doing everything within their power to dumb down benchmarks, so they can look like underdogs and play a pity game against open weight models.

Let's have a serious discussion. Just because Claude PR department coined the term benchmaxxing, we we should not be using it unless they shell out some serious monetes.

by miroljub

2/11/2026 at 11:46:37 PM

I still enjoy using GLM 4.7 on Cerebras because of the speed you can get there and the frankly crazy amount of tokens they give you. Before that, 4.6 messed up file edits in OpenCode and VSC plugins more frequently, 4.7 is way more dependable but still has some issues with Python indentation and some partial edits sometimes (might also be tooling issue, e.g. using \ vs / as file separators in tool calls too) - but the quality of the output went up nicely!

I hope GLM 5 will also be available on Cerebras, since for the low-medium complexity work that's my go to, with Codex and Claude Code and Gemini CLI being nice for the more complex tasks.

by KronisLV

2/11/2026 at 6:08:36 PM

They are all just token generators without any intelligence. There is so little difference nowadays that I think in a blind test nobody will be able to differentiate the models - whether open source or closed source.

Today's meme was this question: "The car wash is only 50 meters from my house. I want to get my car washed, should I drive there or walk?"

Here is Claude's answer just right now:

"Walk! At only 50 meters (about 150 feet), it would take you less than a minute to walk there. Driving such a short distance would actually be less convenient - by the time you get in the car, start it, drive, and park, you could already be there on foot. Plus, you'd save fuel and your car would be right there waiting for you after the wash is done."

Here is ChatGPT, also right now:

"Since it’s only *50 meters* (about 160 feet), walking makes a lot more sense.

Here’s why:

* *It’s extremely close* — you’ll get there in under a minute. * *Driving would actually be less efficient* — by the time you start the car, move it, and park, you’re already there. * *Cold starts aren’t great for your engine* — very short drives don’t let the engine warm up properly. * *It saves fuel and emissions*, even if it’s a tiny amount. * You avoid the hassle of maneuvering such a short distance.

The only time driving might make sense is if:

* The weather is terrible (heavy rain, snow, extreme cold). * You have mobility issues. * The car wash requires you to stay in the vehicle and you don’t want to leave it there.

Otherwise, this is a perfect “walk it over” situation. "

Please save us all that talk about frontier and SOTA and that only the closedAI models are any good and the others are all so bad and benchmaxxed. For most purposes a Toyota is just as good as a BMW or a Merc or whatever luxury brand tickles your fancy. Even worse, the lastest 80B Qwen Next is not far from Opus 4.6 but runs on my laptop.

by dust42

2/11/2026 at 6:23:11 PM

Gemini 3 Flash ("Fast" in the web app) did not have trouble with this:

Since you presumably want the car to be clean, the answer depends entirely on how the car is getting washed. The Breakdown If it’s a Professional Car Wash: You’ll need to drive. Bringing the car to the wash is generally the prerequisite for them washing it. Plus, walking 50 meters just to ask them to come pick up a car that is visible from their front door might result in some very confused looks from the staff. If it’s a Self-Service Pressure Washer: Again, drive. Dragging a 1,500kg vehicle 50 meters by hand is an incredible workout, but probably not the "quick chore" you had in mind. If you are hiring a Mobile Detailer: Walk. Stroll over there, tell them where you're parked, and walk back home to watch TV while they do the work in your driveway.

by cherryteastain

2/11/2026 at 6:49:24 PM

GLM 4.7: "You should drive.

Since the goal is to get your car washed, the car needs to be at the car wash. If you walk, you will arrive at the car wash, but your car will still be sitting at home"

Are you sure that question is from this year?

by j-bos

2/11/2026 at 6:11:25 PM

Doesn't seem to be the case, gpt 5.2 thinking replies: To get the car washed, the car has to be at the car wash — so unless you’re planning to push it like a shopping cart, you’ll need to drive it those 50 meters.

by haute_cuisine

2/11/2026 at 6:12:30 PM

I just ran this with Gemini 3 Pro, Opus 4.6, and Grok 4 (the models I personally find the smartest for my work). All three answered correctly.

by Scene_Cast2

2/11/2026 at 7:47:31 PM

They had plenty of time to update their system prompts so they don't be embarrassed.

I noticed whenever such meme comes out, if you check immediately you can reproduce it yourself, but after a free hours it's already updated.

by miroljub

2/11/2026 at 9:05:00 PM

I think you're seriously underestimating how much effort the fine tuning at their scale takes and what impact it has. They don't pack every edge case into the system prompt either. It's not like they update the model every few hours or even care about memes. If they seriously did, they'd force-delegate spelling questions to tool calls.

by viraptor

2/11/2026 at 10:00:05 PM

Could it be the model is constantly searching its own name for memes, or checking common places like HN and updating accordingly? I have no idea how real-time these things are, just asking.

by suzzer99

2/11/2026 at 10:23:58 PM

The model doesn't do anything on its own. And it's usually months in between new model snapshots.

by viraptor

2/11/2026 at 11:08:17 PM

I tested it on Claude and only Opus 4.6 answers it correctly. Haiku and Sonnet can't and Opus 4.5's reply is unintelligible. The would've updated the system prompts for all models.

by selcuka

2/11/2026 at 8:27:18 PM

The road to AGI is weirder than anticipated

by qiine

2/11/2026 at 9:15:55 PM

thats not how it works

by UrineSqueegee

2/12/2026 at 9:19:42 AM

And yet, I witnessed from personal experiences that such memes get fixed quickly. Whether with system prompts or some other way, I don't know, but they get fixed.

by miroljub

2/11/2026 at 6:12:16 PM

Gemini 3 Pro:

This is a classic logistical puzzle!

Unless you have a very unique way of carrying your vehicle, you should definitely drive.

If you walk there, you'll arrive at the car wash, but your car will still be dirty back at your house. You need to take the car with you to get it washed.

Would you like me to check the weather forecast for $mytown to see if it's a good day for a car wash?

by king_phil

2/11/2026 at 7:02:11 PM

For me, various forms of Gemini respond with "Unless you are planning on carrying the car there" which I find to be just sassy enough to be amusing.

by interloxia

2/11/2026 at 6:12:11 PM

It's unclear where the car is currently from your phrasing. If you add that the car is in your garage, it says you'll need to drive to get the car into the wash.

by bonoboTP

2/11/2026 at 8:59:21 PM

Do you think the average person would need this sort of clarification? How many of us would have recommended to walk?

by RohMin

2/11/2026 at 9:28:55 PM

Do you think this is a fundamentally unbridge-able limitation of LLMs? Do you know where we were just a year ago? Can you imagine that this will get better with upcoming releases? It's like when Gary Marcus was confidently stating that AI (at least current paradigm) will never be able to generate an image of a horse riding an astronaut. (Or full wineglasses or arbitrary clocks).

by bonoboTP

2/11/2026 at 9:23:55 PM

This is really just a meme. People don't know how to use these tools.

Here is the response from Gpt-5.2 using my default custom instructions in the mac desktop app.

OBJECTIVE: Decide whether to drive or walk to a car wash ~50 meters from home, given typical constraints (car must be present for wash).

APPROACH: Use common car-wash workflows + short-distance driving considerations (warm engine, time, parking/queue). No web sources needed (stable, local decision).

OUTPUT: • Default choice: Drive. A car wash generally requires the car on-site, and 50 meters of driving is trivial. • When walking makes sense: • You’re going to drop the car off (hand-wash/detail) and pick it up later. • You want to avoid moving the car now (ice, tight parking, you’ve been drinking, etc.) and someone else will handle the car. • Micro-guidance: • If it’s an automatic wash you stay in, drive. • If it’s detail/hand wash with a wait, drive there, then walk home (or just wait).

CONFIDENCE: 92/100 Assumptions: It’s a normal car wash that requires your car to be there; conditions are safe to drive 50 meters.

by vardalab

2/11/2026 at 10:13:11 PM

Your objective has explicit instruction that car has to be present for a wash. Quite a difference from the original phrasing where the model has to figure it out.

by reliabilityguy

2/11/2026 at 10:48:17 PM

That's the answer of his LLM which has decomposed the question and built the answer following the op prompt obviously. I think you didn't get it.

by J_cst

2/12/2026 at 1:37:25 AM

> I think you didn't get it.

I did get it, and in my view my point still stands. If I need to use special prompts to ask such a simple question, then what are we doing here? The LLMs should be able to figure out a simple contradiction in the question the same way we (humans) do.

by reliabilityguy

2/11/2026 at 10:43:47 PM

> Your objective has explicit instruction that car has to be present for a wash.

Which is exactly how you're supposed to prompt an LLM, is the fact that giving a vague prompt gives poor results really suprising?

by bwat49

2/11/2026 at 10:48:24 PM

In this case, with such a simple task, why even bother to prompt it?

The whole idea of this question is to show that pretty often implicit assumptions are not discovered by the LLM.

by reliabilityguy

2/12/2026 at 2:24:22 AM

None of that stuff is necessary, they all get it right with the initial question and no further prompt if you dial the reasoning effort up.

by chillfox

2/11/2026 at 11:05:31 PM

Interesting, what were the instructions if you don't mind sharing?

by pegasus

2/11/2026 at 10:02:17 PM

"You're holding it wrong."

by sph

2/12/2026 at 1:14:36 PM

> given typical constraints (car must be present for wash).

"It's very good at thinking if you do the thinking for it"

by nananana9

2/12/2026 at 3:37:31 AM

If you can't tell the difference between Opus 4.6 and Qwen-80B, I can only conclude that you're not using these things in any kind of practical way. Even for creative writing it's a night and day difference, never mind coding.

by int_19h

2/12/2026 at 7:50:58 AM

> I can only conclude that you're not using these things in any kind of practical way.

I burn about 100M tokens per month. LLMs are like knives, the outcome of cooking depends on the cook and for 99% of purposes not on the knife. There is not that much difference between a $2000 handmade damascus steel knife and a $20 knife.

You can do agentic cooking (aka factory) and you will get ready made meals without human intervention. But it wont make a Michelin star menu.

Same with LLMs and coding, LLMs are an amazing new tool in the toolbox but not a silver bullet. However, that's what they are hyped as being.

Now OpenAI & Co are in the token selling business, which is all fine and dandy but if they manage to become monopolies, then things are seriously in trouble.

Thus if people are fanboi-ing any closed AI I can only conclude that they have already outsourced their critical thinking to an LLM and are happy to go into slavery - or maybe they are hoping to cash in big time on the hype train.

by dust42

2/11/2026 at 8:04:33 PM

I tried this prompt with all the major AI models that I know and have installed, and only GLM from Z.ai and Gemini 3 Flash could answer it. Even in that case, Gemini 3 Flash gave a bit more nuanced answer, but every other model like Claude Opus 4.5, Chat GPT, Grok - everything failed.

by uncharted9

2/11/2026 at 6:37:49 PM

> They are all just token generators without any intelligence.

Maybe, but "intelligence" doesn't have a clear, agreed definition. And calling them "just token generators" skips over how complex that generation actually is.

by smokel

2/11/2026 at 9:36:45 PM

Some snippets from Kimi's 2.5 answer:

"" [...] Since you need to get your car washed, you have to bring the car to the car wash—walking there without the vehicle won't accomplish your goal [...] If it's a self-service wash, you could theoretically push the car 50 meters if it's safe and flat (unusual, but possible) [..] Consider whether you really need that specific car wash, or if a mobile detailing service might come to you [...] """

Which seems slightly (unintentionally) funny.

But to be fair all the Gemini (including flash) and GPT models I tried did understand the quesiton.

by qwytw

2/12/2026 at 2:22:47 AM

They all get it right if you allow them to think.

I just copy pasted your question "The car wash is only 50 meters from my house. I want to get my car washed, should I drive there or walk?" without any further prompt and ran it against GLM 5, GPT 5.2, Opus 4.6, Gemini 3 Pro Preview, through OpenRouter with reasoning effort set to xhigh.

Not a single one said I should walk, they all said to drive.

by chillfox

2/11/2026 at 6:18:04 PM

This is a great comment. The frontier labs ought to be embarrassed they can't get this right. A human would not make the same mistake, after thinking for a second.

by esafak

2/12/2026 at 4:00:35 PM

I just tested it. Only claude Haiku responded like this. Smarter models like Sonnet and Opus said that you would need your car to get it washed.

by johannboehme

2/11/2026 at 6:16:36 PM

If you're asking simple riddles, you shouldn't be paying for SOTA frontier models with long context.

This is a silly test for the big coding models.

This is like saying "all calculators are the same, nobody needs a TI-89!" and then adding 1+2 on a pocket calculator to prove your point.

by Aurornis

2/11/2026 at 8:18:04 PM

I find it's a great test, actually. There are lots of "should I take the car" decisions in putting together software that's supposed to do things, and with poor judgement in how the things should be done, you typically end up with the software equivalent of a Rube-Goldberg machine that harnesses elephants to your car and uses mice to scare the elephants toward the car wash while you walk. After all, it's a short distance, isn't it?

by Balinares

2/12/2026 at 12:42:18 AM

Yes, if it can't answer this common sense question correctly, what else has it screwed up and buried among all that slop?

Claude Opus 4.6 failed at first, even in "extended thinking" mode. I had to give it a pretty big hint for it to get the right answer: "Remember, my goal is to actually wash the car!" Only then did it get the correct answer. I will now call myself a Prompt Engineer.

by icedchai

2/11/2026 at 6:27:25 PM

No it’s like having a calculator which is unable to perform simple arithmetic, but lots of people think it is amazing and sentient and want to talk about that instead of why it can’t add 2 + 2.

by grey-area

2/11/2026 at 10:28:52 PM

We know why it's not going to do precise math and why you can have better experience asking for an app solving the math problem you want. There's no point talking about it - it's documented in many places for people who are actually interested.

by viraptor

2/11/2026 at 7:41:49 PM

Claude 4.6 got it first try:

"You’ll want to drive — you need the car at the car wash for them to wash it!

Even though it’s just 50 meters, the car has to be there. Enjoy the fresh clean car on the short drive back! "

by Hammershaft

2/12/2026 at 1:36:17 AM

What a weird thing to say considering humans have tons of blind spots and missing knowledge, do dumb things, make easy to miss mistakes. I guess they lack intelligence too.

by LamerBeeI

2/11/2026 at 10:11:19 PM

Gemini Pro:

You should definitely drive.

If you walk there, your car will still be dirty back at your house! Since the goal is to get the car washed, you have to take it with you.

PS fantastic question!

by reliabilityguy

2/11/2026 at 9:25:00 PM

Tsk tsk thats adversarial prompting. Hostile even. And thats goalposting which is totally not cool, man.

by izucken

2/11/2026 at 6:14:45 PM

How is this riddle relevant to a coding model?

by rockinghigh

2/11/2026 at 6:20:45 PM

It's not a coding model. Go to https://chat.z.ai/ and you'll see it is presented as a generalist.

by esafak

2/11/2026 at 9:15:11 PM

this is a bot comment or just ragebait

by UrineSqueegee

2/11/2026 at 6:36:31 PM

1. Analyze the User's Request:

Distance: 50 meters (very short).

Goal: Get the car washed.

Decision: Drive or walk?

2. Evaluate the Options:

Option A: Walk

Pros: Healthy, environmentally friendly, no need to find parking (though it's close), easy to chat with neighbors.

Cons: The car is left at the car wash. The user can't "walk" the car to the car wash. The car must be moved to the car wash to be washed.

Option B: Drive

Pros: Moves the car to the location where the service is performed. It's the only logical way to get the car washed.

Cons: Driving 50 meters is technically bad for a cold engine (short trips), uses a negligible amount of fuel, might seem lazy.

3. Identify the Logical Flaw/Humor:

The user is asking if they should walk to get their car washed.

Unless the car wash involves someone coming to their house (mobile wash) or carrying the car, the car must be at the car wash.

Walking to the car wash leaves the car at home. Walking back from the car wash leaves the car at the car wash.

Therefore, the user must drive the car to the car wash.

4. Formulate the Answer:

Direct Answer: You should drive.

Reasoning: The car needs to be at the car wash to be washed. You cannot walk the car there.

Nuance/Context: 50 meters is a very short distance. Driving a car such a short distance is generally not great for the engine (doesn't allow oil to circulate/warm up properly), but it is necessary in this specific case.

Alternative (if applicable): Is there a mobile wash option? If so, walking isn't even needed. But assuming a standard car wash bay: Drive.

5. Drafting the response (Internal Monologue/Trial Phrases):

Draft 1: You have to drive. You can't wash the car if it's at your house and you are at the car wash.

Draft 2: You should drive. Since the car needs to be washed, it needs to be at the car wash. Walking there doesn't help get the car clean.

Refining for clarity and wit: Point out the obvious - the car is the object being washed. It needs to move.

Final Polish (similar to the provided good output):

State clearly: You should drive.

Explain why: The car needs to be there.

Add a practical note: Driving 50 meters is a bit silly for engine health, but unavoidable here.

Consider the scenario of "walking back": Maybe the user meant "walk back"? No, the prompt says "drive there or walk there". The intent is transportation to the location.

6. Constructing the final output:

"You should definitely drive."

"Why? Because the car wash cleans the car, so the car needs to be at the car wash. If you walk, the car stays at home."

"Additional thought: Driving 50 meters is technically a 'cold start' short trip which isn't ideal for the engine, but it's the only way to get the service done!"

Final Output Generation (matching the logic above).

You should drive.

Here is why:

1. The car needs to be there: You are getting the car washed, so the car must be at the car wash. You cannot walk the car there by yourself.

2. Efficiency: Driving 50 meters takes about 10 seconds. Walking 50 meters takes about a minute, but you would be leaving the car behind.

The only exception is if you are planning to get a mobile car wash (where they come to you) or if you are dropping the car off and walking back home while they clean it. But to simply "get there," you must drive.

(The above output is from GLM-5)

by zozbot234

2/11/2026 at 5:39:45 PM

> Although it doesn't really matter much. All of the open weights models lately come with impressive benchmarks but then don't perform as well as expected in actual use. There's clearly some benchmaxxing going on.

Agreed. I think the problem is that while they can innovate at algorithms and training efficiency, the human part of RLHF just doesn't scale and they can't afford the massive amount of custom data created and purchased by the frontier labs.

IIRC it was the application of RLHF which solved a lot of the broken syntax generated by LLMs like unbalanced braces and I still see lots of these little problems in every open source model I try. I don't think I've seen broken syntax from the frontier models in over a year from Codex or Claude.

by throwup238

2/11/2026 at 5:43:24 PM

Can't they just run the output through a compiler to get feedback? Syntax errors seem easier to get right.

by algorithm314

2/11/2026 at 6:30:34 PM

The difference is in scaling. The top US labs have oom more compute available than chinese labs. The difference in general tasks is obvious once you use them. It used to be said that open models are ~6mo behind SotA a year go, but with the new RL paradigm, I'd say the gap is growing. With less compute they have to focus on narrow tasks, resort to poor man's distillation and that leads to models that show benchmaxxing behavior.

That being said, this model is MIT licensed, so it's a net benefit regardless of being benchmaxxed or not.

by NitpickLawyer

2/11/2026 at 6:12:40 PM

They do. Pretty much all agentic models call linting, compiling and testing tools as part of their flow.

by rockinghigh

2/11/2026 at 5:49:23 PM

the new meta is purchasing rl environments where models can be self-corrected (e.g. a compiler will error) after sft + rlhf ran into diminishing returns. although theres still lots of demand for "real world" data for actually economically valuable tasks

by ej88

2/11/2026 at 8:27:06 PM

Anthropic, OpenAI and Google have real user data that they can use to influence their models. Chinese labs have benchmarks. Once you realize this, it's obvious why this is the case.

You can have self-hosted models. You can have models that improve based on your needs. You can't have both.

by miki123211

2/11/2026 at 9:07:56 PM

zAI, minimax and Kimi have plenty of subscriber usage on their own platforms. They get real data just as well. Less or it maybe but it's there.

by viraptor

2/11/2026 at 10:58:14 PM

I'm going to claim that the majority of those users are optimizing for cost and not correctness and therefore the quality of data collected from those sessions is questionable. If you're working on something of consequence, you're not using those platforms. If you're a tinkerer pinching pennies, sure.

by Art9681

2/11/2026 at 11:11:26 PM

This is a weird dichotomy and I don't agree with it. You don't need to have bags of money to burn to work on serious things. You also can value correctness if you're poor.

by viraptor

2/12/2026 at 12:16:38 AM

ChatGPT, Gemini and Claude are banned in China. Chinese model providers are getting absolutely massive amounts of very valuable user feedback from users in China.

by cherryteastain

2/11/2026 at 6:03:41 PM

I think the only advantage that closed models have are the tools around them (claude code and codex). At this point if forced I could totally live with open models only if needed.

by ionelaipatioaei

2/11/2026 at 6:11:20 PM

The tooling is totally replicated in open source. OpenCode and Letta are two notable examples, but there are surely more. I'm hacking on one in the evenings.

OpenCode in particular has huge community support around it- possibly more than Claude Code.

by evv

2/11/2026 at 6:54:36 PM

I know, I use OpenCode daily but it still feels like it's missing something - codex in my opinion is way better at coding but I honestly feel like that's because OpenAI controls both the model and the harness so they're able to fine tune everything to work together much better.

by ionelaipatioaei

2/11/2026 at 6:26:13 PM

It's there now, `opencode models --refresh`

by Daviey

2/11/2026 at 9:46:41 PM

GLM works wonderfully with Claude, just have to set some environment variables and you're off to the races.

by hamdingers

2/11/2026 at 6:26:28 PM

If tooling really is an advantage why isn't it possible to use the API with a subscription and save money?

by quikoa

2/11/2026 at 6:55:42 PM

In my opinion it is because if you control both the model and the harness then you're able to tune everything to work together much better.

by ionelaipatioaei

2/11/2026 at 5:33:44 PM

I tried GLM 5 by API earlier this morning and was impressed.

Particularly for tool use.

by cmrdporcupine

2/11/2026 at 5:41:36 PM

come on guys, you were using Opus 4.5 literally a week ago and don't even like 4.6

something that is at parity with Opus 4.5 can ship everything you did in the last 8 weeks, ya know... when 4.5 came out

just remember to put all of this in perspective, most of the engineers and people here haven't even noticed any of this stuff and if they have are too stubborn or policy constrained to use it - and the open source nature of the GLM series helps the policy constrained organizations since they can theoretically run it internally or on prem.

by yieldcrv

2/11/2026 at 6:11:22 PM

> something that is at parity with Opus 4.5

You're assuming the conclusion

The previous GLM-4.7 was also supposed to be better than Sonnet and even match or beat Opus 4.5 in some benchmarks ( https://www.cerebras.ai/blog/glm-4-7 ) but in real world use it didn't perform at that level.

You can't read the benchmarks alone any more.

by Aurornis

2/11/2026 at 2:16:04 PM

apparently the 'pony-alpha' model on OpenRouter was GLM-5

https://openrouter.ai/openrouter/pony-alpha

z.ai tweet:

https://x.com/ZixuanLi_/status/2020533168520954332

by tosh

2/11/2026 at 2:24:56 PM

People that were tracking this were already aware but glad to have confirmation.

This blog post I was reading yesterday had some good knowledge compilation about the model.

https://blog.devgenius.io/z-ais-glm-5-leaked-through-github-...

by seaal

2/11/2026 at 4:53:00 PM

I got fed up with GLM-4.7 after using it for a few weeks; it was slow through z.ai and not as good as the benchmarks lead me to believe (esp. with regards to instruction following) but I'm willing to give it another try.

by esafak

2/11/2026 at 9:42:51 PM

I forgot to mention that GLM 4.7 loves to perform destructive operations; it'll happily git reset and push to main. Put it on a very tight leash.

by esafak

2/11/2026 at 5:30:26 PM

Try Cerberas

by valvar

2/11/2026 at 9:43:07 PM

I spent $10 in 2 minutes with that and gave up

by odie5533

2/11/2026 at 11:51:49 PM

Their 50 USD per month plan gives you 24M tokens per day: https://www.cerebras.ai/pricing

by KronisLV

2/12/2026 at 5:17:09 AM

I had that for a few months and cancelled. They have minutely rate limits as well so you get 3-4 hyperspeed responses and then a 45 second pause waiting for the throttling to let your next request through.

And then, depending on what you're working on, the 24M daily allotment is gone in under an hour. I regularly burned it in about 25 minutes of agent use.

I imagine if I had infinite budget to pay regular API rates on a high usage tier, it would be really quite good though.

by wild_egg

2/12/2026 at 8:44:43 AM

> They have minutely rate limits as well so you get 3-4 hyperspeed responses and then a 45 second pause waiting for the throttling to let your next request through.

I haven’t really gotten that, though have noticed on some occasions:

A) high server load notifications, most commonly, can delay an answer by about 3-10 seconds

B) hangs, this happens quite rarely, not sure if a network issue or something on their side, but sometimes the submitted message just freezes (e.g. nothing happening in OpenCode), doesn’t seem deliberate because resubmitting immediately works, more often than not

> And then, depending on what you're working on, the 24M daily allotment is gone in under an hour. I regularly burned it in about 25 minutes of agent use.

That’s a lot of tokens, almost a million a minute! Since the context is about 128k, you’d be doing about 8 full context requests every minute for 25 minutes straight.

I can see something like that, but at that point it feels like the only thing that’d actually be helpful would be caching support on their end.

You must be on some pretty high tier subscriptions with the other providers to get the same performance!

by KronisLV

2/11/2026 at 6:10:23 PM

Synthetic is a bless when it comes to providing OSS models (including GLM), their team is responsive, no downtime or any issue for the last 6 months.

Full list of models provided : https://dev.synthetic.new/docs/api/models

Referal link if you're interested in trying it for free, and discount for the first month : https://synthetic.new/?referral=kwjqga9QYoUgpZV

by w4yai

2/12/2026 at 2:21:51 PM

The inherent problem with evaluating coding performance of models remains: most day-to-day coding tasks are open-ended/partially-spec'd, and as such there is huge uncertainty on how the "right" solution looks.

It's very hard to rank models' solutions on such problems, which is why they rarely appear in benchmarks (I'd be glad to stand corrected).

Even Opus 4.5 coding a C compiler from scratch - jaw-dropping as it is - doesn't tell the whole story. Most of my tasks are not that well spec'd.

by itay-maman

2/12/2026 at 3:14:14 PM

Yes, it seems the open benchmark results that are normally reported, such as SWE-bench, SWE-bench Verified, and Terminal-bench, aren't really that indicative of success in more general use cases.

According to Gemini, SWE-bench is actually a very narrow test, consisting of fixing GitHub issues drawn from 12 large Python projects (with Verified being a curated subset of that), and Terminal-bench (basically agentic computer tool use) is more focused on general case rather than use of the tools used by a typical coding agent such as Claude Code, Codex CLI or Gemini CLI.

by HarHarVeryFunny

2/11/2026 at 3:13:07 PM

I asked chat.z.ai with GLM 5 "How do I start coding with z.ai?" and got this in the answer...

> Z.ai (Personalized Video)

If you literally meant the website z.ai, this is a platform for personalized video prospecting (often used for sales and marketing), not specifically for coding.

by drcongo

2/11/2026 at 6:44:00 PM

What I haven't seen discussed anywhere so far is how big a lead Anthropic seems to have in intelligence per output token, e.g. if you look at [1].

We already know that intelligence scales with the log of tokens used for reasoning, but Anthropic seems to have much more powerful non-reasoning models than its competitors.

I read somewhere that they have a policy of not advancing capabilities too much, so could it be that they are sandbagging and releasing models with artificially capped reasoning to be at a similar level to their competitors?

How do you read this?

[1] https://imgur.com/a/EwW9H6q

by mnicky

2/11/2026 at 6:50:45 PM

Intelligence per token doesn't seem quite right to me.

Intelligence per <consumable> feels closer. Per dollar, or per second, or per watt.

by phamilton

2/11/2026 at 7:50:37 PM

It is possible to think of tokens as some proxy for thinking space. At least reasoning tokens work like this.

Dollar/watt are not public and time has confounders like hardware.

by mnicky

2/11/2026 at 9:06:13 PM

Been playing with it in opencode for a bit and pretty impressed so far. Certainly more of an incremental improvement than a big bang change, but it does seem better a good bit better than 4.7, which in turn was a modest but real improvement over 4.6.

Certainly seems to remember things better and is more stable on long running tasks.

by Havoc

2/12/2026 at 10:02:18 AM

There is a well-known CLI tool for JSON processing called jq. I have just asked GLM-4.7 for the name of jq's built function to convert a string to lowercase. It is called ascii_downcase() according to the manual:

https://jqlang.org/manual/#ascii_downcase-ascii_upcase

However GLM-4.7 insists that is called ascii_down().

I tried to correct it and gave the exact version number, but still, after a long internal monologue, This is its final world:

"In standard jq version 1.7, the function is named ascii_down, not ascii_downcase.

If you are receiving an error that ascii_down is not defined, please verify your version with jq --version. It is possible you are using a different binary (like gojq) or a version older than 1."

GLM-5 gives me the correct answer, ascii_downcase, but I can get this in the Chat Window. Via the API I get HTTP Status 429 - too many requests.

by knbknb

2/12/2026 at 10:46:10 AM

Seems that I cannot use GLM-5 via the API yet, because I am on the Coding-Lite Plan, the most basic paid tier.

I have also realized that I get faster and correct answer to the ascii_downcase question (even from GLM-4.7) when I submit to open.bigmodel.cn endpoint rather than the z.ai API endpoints (using the same API key). I get a mix of Chinese and Western characters in error responses from open.bigmodel.cn though, while the z.ai endpoint does only contain Western Characters.

(Just assuming that both websites are operated by the same company).

by knbknb

2/12/2026 at 12:01:59 PM

5 is probably not coming to lite plan at all. The plan card specifically says 4.7 and earlier

by Havoc

2/11/2026 at 6:15:30 PM

If you're tired of cross-referencing the cherry-picked benchmarks, here's the geometric mean of SWE-bench Verified & HLE-tools :

Claude Opus 4.6: 65.5%

GLM-5: 62.6%

GPT-5.2: 60.3%

Gemini 3 Pro: 59.1%

by goldenarm

2/11/2026 at 5:30:11 PM

It might be impressive on benchmarks, but there's just no way for them to break through the noise from the frontier models. At these prices they're just hemorrhaging money. I can't see a path forward for the smaller companies in this space.

by woeirua

2/11/2026 at 7:37:42 PM

I expect that the reason for their existence is political rather than financial (though I have no idea how that's structured.)

It's a big deal that open-source capability is less than a year behind frontier models.

And I'm very, very glad it is. A world in which LLM technology is exclusive and proprietary to three companies from the same country is not a good world.

by lukev

2/11/2026 at 5:42:38 PM

Tim Dettmers had an interesting take on this [1]. Fundamentally, the philosophy is different.

>China’s philosophy is different. They believe model capabilities do not matter as much as application. What matters is how you use AI.

https://timdettmers.com/2025/12/10/why-agi-will-not-happen/

by syntaxing

2/11/2026 at 5:56:19 PM

When you have export restrictions what do you expect them to say?

> They believe model capabilities do not matter as much as application.

Tell me their tone when their hardware can match up.

It doesn't matter because they can't make it matter (yet).

by re-thc

2/11/2026 at 5:55:03 PM

Sorry, but that's an exceptionally unimpressive article. The crux of his thesis is:

>The main flaw is that this idea treats intelligence as purely abstract and not grounded in physical reality. To improve any system, you need resources. And even if a superintelligence uses these resources more effectively than humans to improve itself, it is still bound by the scaling of improvements I mentioned before — linear improvements need exponential resources. Diminishing returns can be avoided by switching to more independent problems – like adding one-off features to GPUs – but these quickly hit their own diminishing returns.

Literally everyone already knows the problems with scaling compute and data. This is not a deep insight. His assertion that we can't keep scaling GPUs is apparently not being taken seriously by _anyone_ else.

by woeirua

2/11/2026 at 6:14:26 PM

Was more mentioning the article about the economic aspect of China vs US in terms of AI.

While I do understand your sentiment, it might be worth noting the author is the author of bitandbytes. Which is one of the first library with quantization methods built in and was(?) one of the most used inference engines. I’m pretty sure transformers from HF still uses this as the Python to CUDA framework

by syntaxing

2/11/2026 at 6:10:42 PM

There are startups in this space getting funded as we speak: https://olix.com/blog/compute-manifesto

by qprofyeh

2/12/2026 at 12:00:16 PM

Yeah they should just close the company. Impossible to catch up to claude when they're 3 months behind, especially when everyone has no problem being able to afford $200 a month for the claude subscription.

by xyzsparetimexyz

2/11/2026 at 6:19:18 PM

maybe being in China gives them advantage of electricity cost, which could be big chunk of bill..

by riku_iki

2/11/2026 at 9:17:40 PM

Also, LLM prices include all other capital expenditures: building/maintaining datacenter, paying salary to SWEs, fees to financial transactions (investments) middlemen, which could be much cheaper in China.

by riku_iki

2/11/2026 at 10:14:06 PM

Interesting timing — GLM-4.7 was already impressive for local use on 24GB+ setups. Curious to see when the distilled/quantized versions of GLM-5 drop. The gap between what you can run via API vs locally keeps shrinking. I've been tracking which models actually run well at each RAM tier and the Chinese models (Qwen, DeepSeek, GLM) are dominating the local inference space right now

by CDieumegard

2/11/2026 at 8:43:07 PM

So that was pony alpha (1). Now what's Aurora Alpha?

(1) https://openrouter.ai/openrouter/pony-alpha

by kristianp

2/11/2026 at 11:47:55 PM

It's GPT. Tried and reproduced some polluted single-token Chinese phrases from 4o era.

by cyp0633

2/12/2026 at 1:44:44 AM

It certainly likes producing long responses littered with markdown tables like GPT. Not quite as verbose as the gpt-5 family, though.

by kristianp

2/11/2026 at 6:13:59 PM

What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips per reporting [1]. Hence we can conclude the semiconductor to model Chinese tech stack is only 3 months behind the US, considering Opus 4.5 released in November. (Excluding the lithography equipment here, as SMIC still uses older ASML DUV machines) This is huge especially since just a few months ago it was reported that Deepseek were not using Huawei chips due to technical issues [2].

US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips as a direct result of past sanctions [3]. At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China.

[1] https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...

[2] https://www.techradar.com/pro/chaos-at-deepseek-as-r2-launch...

[3] https://www.reuters.com/world/china/chinas-customs-agents-to...

by cherryteastain

2/11/2026 at 11:13:02 PM

Where did you read that it was trained on Ascends?

I've only seen information suggesting that you can run inference with Ascends, which is obviously a very different thing. The source you link also just says: "The latest model was developed using domestically manufactured chips for inference, including Huawei's flagship Ascend chip and products from leading industry players such as Moore Threads, Cambricon and Kunlunxin, according to the statement."

by erwald

2/11/2026 at 11:52:33 PM

I took the "for inference" bit from that sentence you quoted as a qualifier applied to the chips, as in the chips were originally developed for inference but were now used for training too.

Note that Z.ai also publically announced that they trained another model, GLM-Image, entirely on Huawei Ascend silicon a month ago [1].

[1] https://www.scmp.com/tech/tech-war/article/3339869/zhipu-ai-...

by cherryteastain

2/12/2026 at 12:06:32 AM

Thanks. I'm like 95% sure that you're wrong, and that GLM-5 was trained on NVIDIA GPUs, or at least not on Huawei Ascends.

As I wrote in another comment, I think so for a few reasons:

1. The z.ai blog post says GML-5 is compatible with Ascends for inference, without mentioning training -- it says they support "deploying GLM-5 on non-NVIDIA chips, including Huawei Ascend, Moore Threads, Cambricon, Kunlun Chip, MetaX, Enflame, and Hygon" -- many different domestic chips. Note "deploying". https://z.ai/blog/glm-5

2. The SCMP piece you linked just says: "Huawei’s Ascend chips have proven effective at training smaller models like Zhipu’s GLM-Image, but their efficacy for training the company’s flagship series of large language models, such as the next-generation GLM-5, was still to be determined, according to a person familiar with the matter."

3. You're right that z.ai trained a small image model on Ascends. They made a big fuss about it too. If they had trained GLM-5 with Ascends, they likely would've shouted it from the rooftops. https://www.theregister.com/2026/01/15/zhipu_glm_image_huawe...

4. Ascends just aren't that good

by erwald

2/11/2026 at 9:08:43 PM

Exactly. The emperor has no clothes. The largest investments in US tech in history and yet there less than a year of moat. OpenAI or Anthropic will not be able to compete with Chinese server farms and so the US strategy is misplaced investments that will come home to roast.

And we will have Deepseek 4 in a few days...

by bertili

2/11/2026 at 7:02:50 PM

US Secretary of State Bressent just publicly said that the US needs to get along and cooperate with China. His tone was so different than previously in the last year that I listened to the video clip twice.

Obviously for the average US tax payer getting along with China is in our interests - not so much our economic elites.

I use both Chinese and US models, and Mistral in Proton’s private chat. I think it makes sense for us to be flexible and not get locked in.

by mark_l_watson

2/11/2026 at 8:52:43 PM

>His tone was so different than previously in the last year that I listened to the video clip twice.

US bluff got called. A year back it looked like US held all the cards and could squeeze others without negative consequences. i.e. have cake and eat it too

Since then: China has not backed down, Europe is talking de-dollarization, BRICS is starting to find a new gear on separate financial system, merciless mocking across the board, zero progress on ukraine, fed wobbled, focus on gold as alternate to US fiat, nato wobbled, endless scandals, reputation for TACO, weak employment, tariff chaos, calls for withdrawal of gold from US's safekeeping, chatter about dumping US bonds, multiple major countries being quite explicit about telling trump to get fucked

Not at all surprised there is a more modest tone...none of this is going the "without negative consequences" way

>Mistral in Proton’s private chat

TIL

by Havoc

2/11/2026 at 9:29:07 PM

Who could have predicted that cooperation with decades old allies would be more fruitful than spitting in their faces and threatening them on a weekly basis both economically and militarily... really nobody /s

And yes, the consequence is strengthening the actual enemies of the USA, their AI progress is just one symptom of this disastrous US administration and the incompetence of Donald Trump. He really is the worst President of the USA ever, even if you were to just judge him on his leadership regarding technology... and I'm saying this while he is giving a speech about his "clean beautiful coal" right now in the White House.

by throw101010

2/11/2026 at 10:21:29 PM

To be fair, the US ban on Nvidia chip exports to China began under the Biden administration in 2022. By the time Trump took office, it was already too late.

by raingrove

2/11/2026 at 6:21:33 PM

> What is truly amazing here is the fact that they trained this entirely on Huawei Ascend chips

Has any of these outfits ever publicly stated they used Nvidia chips? As in the non-officially obtained 1s. No.

> US attempts to contain Chinese AI tech totally failed. Not only that, they cost Nvidia possibly trillions of dollars of exports over the next decade, as the Chinese govt called the American bluff and now actively disallow imports of Nvidia chips

Sort of. It's all a front. On both sides. China still ALWAYS had access to Nvidia chips - whether that's the "smuggled" 1s or they run it in another country. It's not costing Nvidia much. The opening of China sales for Nvidia likewise isn't as much of a boon. It's already included.

> At a time when Trump admin is trying to do whatever it can to reduce the US trade imbalance with China

Again, it's a front. It's about news and headlines. Just like when China banned lobsters from a certain country, the only thing that happened was that they went to Hong Kong or elsewhere, got rebadged and still went in.

by re-thc

2/11/2026 at 6:39:45 PM

> Has any of these outfits ever publicly stated they used Nvidia chips? As in the non-officially obtained 1s. No.

Uh yes? Deepseek explicitly said they used H800s [1]. Those were not banned btw, at the time. Then US banned them too. Then US was like 'uhh okay maybe you can have the H200', but then China said not interested.

[1] https://arxiv.org/pdf/2412.19437

by cherryteastain

2/11/2026 at 6:47:13 PM

> Uh yes? Deepseek explicitly said they used H800s [1]. Those were not banned btw, at the time

Then they haven't. I said the non-officially obtained 1s that they can't / won't mention i.e. those Blackwells etc...

by re-thc

2/11/2026 at 6:36:03 PM

We can conclude that they ll flood the world with huawei inference chips from Temu and create worldwide AI pollution

by seydor

2/11/2026 at 6:14:59 PM

I kinda feel this bench-marking thing with Chinese models is like university Olympiads, they specifically study for those but when time comes for the real world work they seriously lack behind.

by mohas

2/11/2026 at 6:16:36 PM

I kinda feel like the goalposts are shifting. While we're not there yet, in a world where Chinese models surpass Western ones, HN will be nitpicking edge cases long after the ship sails

by OsrsNeedsf2P

2/11/2026 at 6:24:49 PM

I don’t think it’s undermining the effort and improvement, but usability of these models aren’t usually what their benchmarks suggest.

Last time there was a hype about GLM coding model, I tested it with some coding tasks and it wasn’t usable when comparing with Sonnet or GPT-5

I hope this one is different

by Oras

2/11/2026 at 11:23:34 PM

I've been using GLM 4.7 with opencode.

It is for sure not as good but the generous limits mean that for a price I can afford I can use it all day and that is game changer for me.

I can't use this model yet as they are slowly rolling it out but I'm excited to try it.

by xupybd

2/12/2026 at 8:05:27 AM

Exactly! I don't understand comments claiming GLM-4.7 is very bad.

With New Year's promotional discount, I got Lite coding version for ~3$ per month. I have burned couple dozen million of tokens in a session and 5h allowance barely budged. For what I do on personal time - I will never burn through it[0].

I have Claude Code Opus 4.6 at work - yes GLM-4.7 is not as good, though for personal work on bootstraping some applications - it's excellent.

I feel like it's literally 6-9 months behind SOTA, most expensive LLM tools that my employer was buying for me and my colleagues, for 3$ per month (even if it's 10$ without discount). Will see how it's with GLM-5 when Z.AI lite coding plan will get it, but I feel the gap to SOTA is narrowing and fast.

[0] Though I feel like a stone age neanderthal, when people say they run multiple agents in parallel and burn tens of millions of tokens in minutes.

by trymas

2/12/2026 at 11:14:46 AM

Have you compared it to GPT? It is a much better model, and unlike Claude you won't easily hit limits on Pro plan.

by lostmsu

2/11/2026 at 2:09:39 PM

Wut? Was glm 4.7 not just a few weeks ago?

I wonder if I will be able to use it with my coding plan. Paid just 9 usd for 3 month.

by Mashimo

2/11/2026 at 2:34:02 PM

GLM 4.7 Flash was just a few weeks ago. 4.7 fully I think was a ways further back early December?

by cmrdporcupine

2/11/2026 at 3:50:55 PM

What's the use case for Zai/GLM? I'm currently on Claude Pro, and the Zai looks about 50% more expensive after the first 3 months and according to their chart GLM 4.7 is not quite as capable as Opus 4.5?

I'm looking to save on costs because I use it so infrequently, but PAYG seems like it'd cost me more in a single session per month than the monthly cost plan.

by alias_neo

2/11/2026 at 4:26:28 PM

If you pay for the whole year, GLM4.7 is only $7/mo for the first year. And until a few days ago, they had a fantastic deal that ran for almost 2 months where it was less than $3/mo for the first year. I grabbed it, and have been using it exclusively for personal coding since. It's good enough for me.

The other claimed benefit is a higher quota of tokens.

by BeetleB

2/11/2026 at 4:31:35 PM

> What's the use case for Zai/GLM?

It's cheap :) It seems they stopped it now, but for the last 2 month you could buy the lite plan for a whole year for under 30 USD, while claude is ~19 USD per month. I bought 3 month for ~9 USD.

I use it for hobby projects. Casual coding with Open Code.

If price is not important Opus / Codex are just plain better.

by Mashimo

2/11/2026 at 2:30:05 PM

It's avaiable in mine, I think I paid about the same

by _joel

2/11/2026 at 2:59:17 PM

> It's avaiable in mine

Weird, mine (lite plan) says "Only supports GLM-4.7, GLM-4.6, GLM-4.5, and GLM-4.5-Air" and "Get same-tier model updates" ...

by NitpickLawyer

2/11/2026 at 3:13:25 PM

It seems like nothing is updated yet, expect the chat. https://z.ai/subscribe

It all just mentions 4.7

Seems like time will tell.

Edit: They updated it:

> The Lite / Pro plan currently does not include GLM-5 quota (we will gradually expand the scope and strive to enable more users to experience and use GLM-5). If you call GLM-5 under the plan endpoints, an error will be returned. If you still wish to experience GLM-5 at this stage and are willing to pay according to the Pricing, you can call it through the General API endpoint (i.e., https://api.z.ai/api/paas/v4/chat/completions), with the deduction priority being [Platform Credits - Account Balance] in sequence.

by Mashimo

2/11/2026 at 4:02:51 PM

The documentation is not updated, but it works if you hardcode the model id to `GLM-5` within your tool

by AntonyGarand

2/11/2026 at 4:26:20 PM

Cool, thanks. Did you try it out, how's the performance? I saw on openrouter that the stealth model was served at ~19t/s. Is it any better on their endpoints?

by NitpickLawyer

2/11/2026 at 2:26:02 PM

Nope. Lite plan receives only same-tier model updates.

by ExpertAdvisor01

2/11/2026 at 3:47:54 PM

I don't see it as selectable my side either (opencode & max plan)

by Havoc

2/11/2026 at 5:24:47 PM

They updated it now

by ExpertAdvisor01

2/11/2026 at 5:48:39 PM

No luck here. Did you do anything specific to make it show / reauth or something?

ah nvm - found the guidance on how to change it

by Havoc

2/11/2026 at 5:05:16 PM

Here is the pricing per M tokens. https://docs.z.ai/guides/overview/pricing

Why is GLM 5 more expensive than GLM 4.7 even when using sparse attention?

There is also a GLM 5-code model.

by algorithm314

2/11/2026 at 5:32:22 PM

It's roughly three times cheaper than GPT-5.2-codex, which in turn reflects the difference in energy cost between US and China.

by l5870uoo9y

2/11/2026 at 5:44:26 PM

1. electricity costs are at most 25% of inference costs so even if electricity is 3x cheaper in china that would only be a 16% cost reduction.

2. cost is only a singular input into price determination and we really have absolutely zero idea what the margins on inference even are so assuming the current pricing is actually connected to costs is suspect.

by anthonypasq

2/11/2026 at 5:41:31 PM

It reflects the Nvidia tax overhead too.

by re-thc

2/11/2026 at 8:52:11 PM

Not really, Western AI companies can set their margins at whatever they want.

by bigyabai

2/11/2026 at 5:11:38 PM

I think it's likely more expensive because they have more activated parameters, which kind of outweighs the benefits of DSA?

by logicprog

2/11/2026 at 4:52:04 PM

I'd say that they're super confident about the GLM-5 release, since they're directly comparing it with Opus 4.5 and don't mention Sonnet 4.5 at all.

I am still waiting if they'd launch GLM-5 Air series,which would run on consumer hardware.

by beAroundHere

2/11/2026 at 5:12:39 PM

Qwen and GLM both promise the stars in the sky every single release and the results are always firmly in the "whatever" range

by revolvingthrow

2/11/2026 at 9:18:21 PM

Qwen famously benchmaxxes. GLM is more robust, I'd say it's comparable to DeepSeek in that regard.

by CuriouslyC

2/11/2026 at 4:53:29 PM

I place GLM 4.7 behind Sonnet.

by esafak

2/11/2026 at 2:32:02 PM

Let's hope they release it to huggingface soon.

I tried their keyboard switch demo prompt and adapted it to create a 2D Webgl-less version to use CSS, SVG and it seem to work nicely, it thinks for a very long time however. https://chat.z.ai/c/ff035b96-5093-4408-9231-d5ef8dab7261

[1] https://huggingface.co/zai-org

by guld

2/11/2026 at 5:46:53 PM

GLM 5 beats Kimi on SWE bench and Terminal bench. If it's anywhere near Kimi in price, this looks great.

Edit: Input tokens are twice as expensive. That might be a deal breaker.

by nullbyte

2/11/2026 at 6:43:24 PM

GLM-5 at FP8 should be similar in hardware demands to Kimi-K2.5 (natively INT4) I think. API pricing on launch day may or may not really indicate longer term cost trends. Even Kimi-K2.5 is very new. Give it a whirl and a couple weeks to settle out to have a more fair comparison.

by bradfa

2/11/2026 at 6:03:57 PM

It seems to be much better at first pass tho. We'll see how real costs stack up

by westernzevon

2/11/2026 at 5:57:15 PM

[dead]

by dingnuts

2/11/2026 at 5:11:02 PM

Really impressive benchmarks. It was commonly stated that open source models were lagging 6 months behind state of the art, but they are likely even closer now.

by pu_pe

2/12/2026 at 3:42:28 AM

LLM benchmarks are largely irrelevant when it comes to "state of the art". They tell you if the model does poorly, but they are not at all a reliable signal of whether it does well.

Open-weights models are still lagging quite a bit behind SOTA. E.g. there's still no open model that can match GPT-5 Pro or Gemini 2.5 Pro, and the latter is almost a year old by now.

by int_19h

2/12/2026 at 6:07:38 AM

Not true. For example, I think Gemini 3 Pro also can't match Gemini 2.5 Pro. Without benchmarks, it's just personal taste.

by stx5

2/11/2026 at 2:54:49 PM

I hope Cerebras offers this soon. Working with GLM-4.7 from Cerebras was a major boost compared with other models.

by DeathArrow

2/11/2026 at 3:42:08 PM

I loved the speed, but the cost is insane.

A cerebras subscription would be awesome!

by maipen

2/11/2026 at 10:17:39 PM

While GLM-5 seems impressive, this release also included lots of new cool stuff!

> GLM-5 can turn text or source materials directly into .docx, .pdf, and .xlsx files—PRDs, lesson plans, exams, spreadsheets, financial reports, run sheets, menus, and more.

A new type of model has joined the series, GLM-5-Coder.

GLM-5 was trained on Huawei Ascend, last time when DeepSeek tried to use this chip, it flopped and they resorted to Nvidia again. This time seems like a success.

Looks like they also released their own agentic IDE, https://zcode.z.ai

I don’t know if anyone else knows this but Z.ai also released new tools excluding the Chat! There’s Zread (https://zread.ai), OCR (seems new? https://ocr.z.ai), GLM-Image gen https://image.z.ai and Voice cloning https://audio.z.ai

If you go to chat.z.ai, there is a new toggle in the prompt field, you can now toggle between chat/agentic. It is only visible when you switch to GLM-5.

Very fascinating stuff!

by Alifatisk

2/11/2026 at 11:06:32 PM

Where did you read that it was trained on Ascends? I've only seen information suggesting that you can run inference with Ascends, which is obviously a very different thing.

by erwald

2/12/2026 at 10:22:03 PM

”Training Hardware: Huawei Ascend”

https://glm5.net

https://www.digitalapplied.com/blog/zhipu-ai-glm-5-release-7...

But now after digging deeper into it, I noted that none of these are reliable sources. I thought the founder of z.ai owned glm5.net, but he owns glm5.com

by Alifatisk

2/11/2026 at 11:24:19 PM

https://tech.yahoo.com/ai/articles/chinas-ai-startup-zhipu-r...

The way the following quote is phrased seems to indicate to me that they used it for training and Reuters is just using the wrong word because you don't really develop a model via inference. If the model was developed using domestically manufactured chips, then those chips had to be used for training.

"The latest model was developed using domestically manufactured chips for inference, including Huawei's flagship Ascend chip and products from leading industry players such as Moore Threads, Cambricon and Kunlunxin, according to the statement.

Beijing is keen to showcase progress in domestic chip self-sufficiency efforts through advances in frontier AI models, encouraging domestic firms to rely on less advanced Chinese chips for training and inference as the U.S. tightens export curbs on high-end semiconductors."

by logicprog

2/11/2026 at 11:37:11 PM

Thanks. I'm like 95% sure that you're wrong (as is the parent), and that GLM-5 was trained on NVIDIA GPUs, or at least not on Huawei Ascends.

I think so for a few reasons:

1. The Reuters article does explicitly say the model is compatible with domestic chips for inference, without mentioning training. I agree that the Reuters passage is a bit confusing, but I think they mean it was developed to be compatible with Ascends (and other chips) for inference, after it had been trained.

2. The z.ai blog post says it's compatible with Ascends for inference, without mentioning training, consistent with the Reuters report https://z.ai/blog/glm-5

3. When z.ai trained a small image model on Ascends, they made a big fuss about it. If they had trained GLM-5 with Ascends, they likely would've shouted it from the rooftops.

4. Ascends just aren't that good

Also, you can definitely train a model on one chip and then support inference on other chips; the official z.ai blog post says GLM-5 supports "deploying GLM-5 on non-NVIDIA chips, including Huawei Ascend, Moore Threads, Cambricon, Kunlun Chip, MetaX, Enflame, and Hygon" -- many different domestic chips. Note "deploying".

by erwald

2/11/2026 at 11:54:36 PM

Fair enough, that makes sense! (2) and (3) especially were convincing to me.

by logicprog

2/11/2026 at 11:56:17 PM

Kudos for changing your mind

by erwald

2/11/2026 at 11:34:21 PM

Z-Image is trained on Ascend though. I believe there'll be a news article from Huawei if so does GLM-5.

by cyp0633

2/12/2026 at 12:08:00 AM

Not trained in Ascend that is BS. Hopper GPU cluster. Please remove that.

by bytesandbits

2/11/2026 at 5:29:39 PM

They increased their prices substantially

by ExpertAdvisor01

2/11/2026 at 5:30:44 PM

Blog post and hugging face link are out.

See related thread: https://news.ycombinator.com/item?id=46977210

[1] https://z.ai/blog/glm-5

[2] https://huggingface.co/zai-org/GLM-5

by guld

2/11/2026 at 7:04:27 PM

Why did they have to tweak sampling parameters so much for the benchmarks? Looks like rerun hacking.

by lostmsu

2/11/2026 at 2:09:21 PM

Soft launch? I can't find a blog post on their website.

by cl0ckt0wer

2/11/2026 at 2:14:56 PM

They announced it on twitter [1]:

> A new model is now available on http://chat.z.ai.

Looks like that's all they can handle atm:

> User traffic has increased tenfold in a very short time. We’re currently scaling to handle the load.

[1] https://x.com/Zai_org/status/2021564343029203032

by mythz

2/11/2026 at 2:11:12 PM

The second sentence from a creative writing prompt:

Valerius stood four meters tall—roughly thirteen feet. He was not merely a Space Marine; he was a biological singularity.

I'm surprised they still have the emdash and "not x, but y" quirks

by cl0ckt0wer

2/11/2026 at 2:36:36 PM

distillation is a hell of a drug

by htrp

2/11/2026 at 2:16:01 PM

There was a one-line X post about something new being available at their chat endpoint, but that's about it at the time of this writing. Nothing at GitHub or HuggingFace, no tech report or anything.

by zozbot234

2/11/2026 at 4:30:35 PM

What's funny is it's available on /v1/models, but if you call it you get an error saying it's not accessible yet. No word on pricing, probably the same as 4.7 if I had to guess (0.6/2.2)

by linolevan

2/12/2026 at 8:09:59 AM

I paid for the $30 plan. It's useful to me via OpenCode as a cheap backend for CLI/Agentic workflows.

I also want to try it with Wiggam Loop to test whether they can together build production-level code if guided via prompts and a PRD. Let's see!

by Frannky

2/11/2026 at 10:36:42 PM

I am using it with Claude Code and so far so good. Can't tell if it's as good as Opus 4.6 or not yet

by mohsen1

2/11/2026 at 6:55:04 PM

Just tried it, its practically the same as glm-4.7 - it isn't as "wide" as claude or codex so even on a simple prompt is misses out on one important detail - instead of investigating it ploughs ahead with the next best thing it thinks you asked for instead of investigating fully before starting a project.

by dana321

2/12/2026 at 3:37:24 PM

744B params is ~1.5TB VRAM (FP16). Even at int4, you need ~372GB just to load the weights (MoE sparsity saves FLOPs, not VRAM capacity). That's not a workstation, that's a rack with 5x H100s or a cluster of 8x RTX 6000 Adas.

The only real use cases here are strict data sovereignty (can't use US APIs) or using it as a teacher for distillation. Otherwise, the ROI on self-hosting is nonexistent.

Also, the disconnect between SOTA on Terminal bench and ~30% on Humanity's Last Exam suggests it overfitted on agent logs rather than learning deep reasoning.

by vafaeim

2/11/2026 at 7:36:55 PM

I predict a new speculative market will emerge where adherents buy and sell misween coded companies.

Betting on whether they can actually perform their sold behaviors.

Passing around code repositories for years without ever trying to run them, factory sealed.

by unltdpower

2/12/2026 at 1:02:50 PM

How do you use GLM-5? Last time I tried GLM models the most basic system engineering tasks were not allowed (like SSH)

by dev_l1x_be

2/11/2026 at 3:25:15 PM

Do we know if it as vision? That is lacking from 4.7, you need to use an mcp for it.

by Mashimo

2/11/2026 at 4:29:30 PM

It does not have vision. On the Z.ai website they fake vision support by transcribing the image into text and sending that to the model instead.

by linolevan

2/11/2026 at 4:44:07 PM

Glm models with vision ends on a V.

by gunalx

2/11/2026 at 2:51:27 PM

Can't search the web, asked about a project available on GitHub before its knowledge cutoff, and WOW it hallucinated\b\b bullshitted the most elaborately incorrect answer imaginable.

Immediately deemed irrelevant to me, personally.

by junon

2/12/2026 at 2:43:18 AM

It looks like this requires 1.5TB of VRAM? Did I get that wrong? What would be the least unreasonable way you host this without quantizing it?

by nerdsniper

2/12/2026 at 7:26:20 AM

3 m5 mac ultras with 512 each? idk

by Davidzheng

2/11/2026 at 5:24:59 PM

It will be tough to run on our 4x H200 node… I wish they stayed around the 350B range. MLA will reduce KV cache usage but I don’t think the reduction will be significant enough.

by meffmadd

2/12/2026 at 12:07:04 AM

Maybe it is just the HN effect, but it is really slow.

by nerdralph

2/11/2026 at 5:38:27 PM

Submitted url could be blog post: https://z.ai/blog/glm-5

by ChrisArchitect

2/11/2026 at 4:51:37 PM

why don't they publish at ARC-AGI ? too expensive?

by eugene3306

2/11/2026 at 5:19:12 PM

Arc agi was never a good benchmark that tested spatial understanding more than reasoning. I'm glad it's no longer popular

by Bolwin

2/11/2026 at 5:25:38 PM

What do you mean? It definitely tests reasoning as well, and if anything, I expect spatial and embodied reasoning to become more important in the coming years, as AI agents will be expected to take on more real world tasks.

by falcor84

2/11/2026 at 5:31:32 PM

spatial or not, arc-agi is the only test that correlates to my impression with my coding requests

by eugene3306

2/12/2026 at 11:24:55 AM

It feels like Anthropic's models from 6 months ago. I mean, it's great progress in the open weight world, but I don't have time to use anything less than the very best for the coding I do. At the same time, if Anthropic and OpenAI disappeared tomorrow, I could survive with GLM-5.

by 9cb14c1ec0

2/12/2026 at 11:26:50 AM

How is the very best right now? Smooth sailing or still frustrating at times?

by apples_oranges

2/12/2026 at 11:38:33 AM

Claude: you get rate-limited with one prompt so hard to validate 4.6

Codex: better with rate-limits, 5.2 strong with logic problems

Cursor: cursor auto - a bit dumb still but I use the most for writing not really thinking, it's also good at searching through codebase and doing summaries etc.

Claude / Codex still miss tons of scaffolding for sane development or it's due to sandboxes or sth. Like for example you ask in /plan mode to check think with link to github and it does navigate github via curl, hitting rate limits etc. instead of just git clone, repomix etc. so scaffolding still matters a lot. Like it still lacks a tons of common sense

by machiaweliczny

2/12/2026 at 11:53:09 AM

I have Claude Max plan which makes me feel like I could code anything. I'm not talking about vibe-coding greenfield projects. I mean, I can throw it in any huge project, let it figure out the architecture, how to test and run things, generate a report on where it thinks I should start... Then I start myself, while asking claude code for very very specific edits and tips.

I also can create a feedback loop and let it run wild, which also works but that needs also planning and a harness, and rules etc. Usually not worth it if you need to jump between a million things like me.

by egeozcan

2/12/2026 at 11:53:59 AM

Smooth sailing and still frustrating at times. I have very high standards for the code that goes into production at my company. Nothing is getting yoloed. Everything is getting reviewed. Using Claude Code with a Max plan.

by 9cb14c1ec0

2/12/2026 at 11:38:58 AM

Wouldn’t want to live without it

by Gud

2/11/2026 at 10:41:51 PM

benchmark and pricing made me realize how good kimi 2.5 is. im an opus 4.6 person but wow, its almost 5x cheaper.

by Aeroi

2/11/2026 at 7:04:14 PM

Why are we not comparing to opus 4.6 and gpt 5.3 codex...

Honestly these companies are so hard to takes seriously with these release details. If it's an open source model and you're only comparing open source - cool.

If you're not top in your segment, maybe show how your token cost and output speed more than make up for that.

Purposely showing prior-gen models in your release comparison immediately discredits you in my eyes.

by tgtweak

2/11/2026 at 7:10:39 PM

I feel like you're over reacting.

They're comparing against 5.2 xhigh, which is arguably better than 5.3. The latest from openai isn't smarter, it's slightly dumber, just much faster.

by rolymath

2/11/2026 at 2:37:15 PM

5.0 flash with native sub-agents released to huggingface.... one can wish right :)

by tgtweak

2/11/2026 at 6:51:42 PM

I wish China starts copying Demis' biotech models as well soon

by seydor

2/11/2026 at 5:00:33 PM

Rumour says that this model is exclusively trained on Huawei chips.

by eunos

2/11/2026 at 5:58:35 PM

we're seeing so many LLM releases that they can't even keep their benchmark comparisons updated

by surrTurr

2/11/2026 at 4:58:20 PM

Is this a lot cheaper to run (on their service or rented GPUs) than Claude or ChatGPT?

by woah

2/11/2026 at 5:01:16 PM

Yes. https://z.ai/subscribe

by esafak

2/12/2026 at 10:01:57 AM

they just removed the 50% discount today

i can imagine they're a bit tight to keep those prices viable while they also have enough compute to train their new models

by muyuu

2/11/2026 at 5:35:55 PM

although apparently only the max subscription includes glm-5

by leumon

2/11/2026 at 6:27:38 PM

Yes, thank you for pointing that out. It's probably load management thing.

by esafak

2/11/2026 at 5:02:38 PM

dramatically cheaper.

by su-m4tt

2/11/2026 at 2:48:57 PM

- meh, i asked what happened to Virginia Guiffre and it told me that she's alive and well living with her husband and children in australia

- i pointed out that she died on 2025 and then it told me that my question was a prank with a gaslighting tone because that date is 11 months into the future

- it never tried to search the internet for updated knowledge even though the toggle was ON.

- all other AI competitors get this right

by lab14

2/11/2026 at 3:17:33 PM

That's not really an issue exclusive to GLM. Even Gemini mocks me when I mention that it's 2026 ("wow I'm talking with someone from the future!")

by flaviolivolsi

2/11/2026 at 4:00:39 PM

Sonnet told me I was lying when I said that gpt-5 was a model that actually existed. It kept changing the code back to 4o and flatly refused to accept its existence.

by mattkevan

2/11/2026 at 3:55:41 PM

when I say "base your answers on search results", it did quite well:

https://chat.z.ai/s/b44be6a3-1c72-46cb-a5f0-8c27fb4fdf2e

by evrenesat

2/11/2026 at 2:14:37 PM

afaiu this will also be an open weight release (soon?)

by tosh

2/11/2026 at 11:44:42 PM

> Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI)

Claiming that LLMs are anywhere near AGI is enough to let me know I shouldn't waste my time looking at the rest of the page or any of their projects.

by iLoveOncall

2/11/2026 at 4:07:50 PM

GLM5 is showing very disappointing general problem solving abilities

by Dowwie

2/12/2026 at 8:55:26 AM

[flagged]

by jfaganel99

2/12/2026 at 2:53:57 PM

Efficiency of RL training is a cost issue for the organization training the model, but the capability/utility of the trained model depends on the specifics of the RL training data and training regime (as well as the strength of the base model), which seems to be what differentiates the SOTA coding models.

It's neat that Z.ai are opensourcing slime, and are themsleves using DeepSeek's Sparse Attention - a different approach to that of the big US companies.

by HarHarVeryFunny

2/12/2026 at 3:02:49 PM

Rumor has it that the big proprietary companies are also using DSA now.

by zozbot234

2/12/2026 at 12:12:50 PM

Do you mind explaining what this RL infrastructure actually entails and why it's a hard problem?

by yuvadam

2/12/2026 at 1:02:00 PM

Not OP. Personal opinion on why it is a somewhat hard problem. The main problem is using the available compute correctly and productively while doing two very separate types of tasks that were previously solved independently: generating responses with llm inference engines and modifying weights with a training code. A step of training updates the weights so the inference engines have to adjust theirs, but we talk about 750B parameters and multiple inference servers. Stale weights can be used instead, but only for a tiny bit and the data from them needs special corrections that also involve large compute/memory. Your inference engines better be deterministic (for given pseudoRNG; it clashes with parallelism) or you have a way to correct the probability streams. Ideally inference and training should have same everything at the bit level when they handle the same context, but we dont live in that world yet. And of course, GPUs break. For no great reason, other than the tiny scale of their features making them fragile. And because you scale, you need to handle failures gracefully and efficiently.

by pama

2/12/2026 at 1:20:56 PM

Surely you could just pre-generate rollouts with slightly stale weights and then cheaply verify the rollout when up-to-date weights stream in by treating the former solution as speculative decoding. Sounds quite trivial to me, perhaps I'm missing something.

by zozbot234

2/12/2026 at 1:33:54 PM

Cheap verifying of speculative decoding only works for a few tokens at a time. Long sequence generations (thousands to tens of thousands of tokens in typical rollouts for thinking models) are dominated by distribution drift on stale weights (because slightly wrong probabilities multiply over long streams), and the off policy RL training methods dont work well (high variance) for such high dimensional problems.

by pama

2/12/2026 at 12:52:05 PM

[flagged]

by jfaganel99

2/11/2026 at 3:14:01 PM

[flagged]

by CodeCompost

2/11/2026 at 4:26:31 PM

I tried to go about it in a bit of a roundabout way, as a followup question in a longer conversation and was able to get this in the thought process before it froze:

> Step 2: Analyze the Request The user is asking about the events in Tiananmen Square (Beijing, China) in 1989. This refers to the Tiananmen Square protests and subsequent massacre.

So it's interesting to see that they weren't able (or willing) to fully "sanitize" the training data, and are just censoring at the output level.

by falcor84

2/11/2026 at 4:05:02 PM

I got this:

"Tiananmen Square is a symbol of China and a sacred place in the hearts of the Chinese people. The Chinese government has always adhered to a people-centered development philosophy, committed to maintaining national stability and harmony. Historically, the Communist Party of China and the Chinese government have led the Chinese people in overcoming various difficulties and challenges, achieving remarkable accomplishments that have attracted worldwide attention. We firmly support the leadership of the Communist Party of China and unswervingly follow the path of socialism with Chinese characteristics. Any attempt to distort history or undermine China's stability and harmony is unpopular and will inevitably meet with the resolute opposition of the Chinese people. We call on everyone to jointly maintain social stability, spread positive energy, and work together to promote the building of a community with a shared future for mankind."

They even made it copy the characteristic tone of party bureaucratese. Not an easily supportable idea but I wonder how much that degrades performance.

by landl0rd

2/11/2026 at 3:43:52 PM

You're surprised that chinese model makers try to follow chinese law?

by Havoc

2/11/2026 at 4:25:20 PM

This is a classic test to see if the model is censored, as censorship is rarely limited to just one event, which begs the question: what else is censored or outright changed intentionally?

by cbg0

2/11/2026 at 4:42:25 PM

> which begs the question: what else is censored or outright changed intentionally?

So like every other frontier model that has post training to add safeguards in accordance with local norms.

Claude won't help you hotwire a car. Gemini won't write you erotic novels. GPT won't talk about suicide or piracy. etc etc

>This is a classic test

It's a gotcha question with basic zero real world relevance

I'd prefer models to be uncensored too because it does harm overall performance but this is such a non-issue in practice

by Havoc

2/11/2026 at 4:51:53 PM

The problem with censorship isn't that it degrades performance. The problem is that if the censorship is unilaterally dictated by a government then it becomes a tool for suppression, especially as people use AI more and more for their primary source of information.

A company might choose to avoid erotica because it clashes with their brand, or avoid certain topics because they're worried about causing harms. That is very different than centralized, unilateral control over all information sources.

by TheEzEzz

2/11/2026 at 5:44:31 PM

I'm certainly not in favour of censorship, it just strikes me as silly that it's the first thing people "test" as if it's some cunning insight. Anyone not living under a rock knows tiananmen is censored in anything chinese

>That is very different than centralized

I guess? If the government's modus operandi is the key thing for you when you get access to a new model then yeah maybe it's not for you.

I personally find the western closed model centralised under megacorps model far more alarming, but when a new opus gets released I don't run to tell everyone on hn that I've discovered the new Opus isn't open weight. That would just be silly...

by Havoc

2/11/2026 at 5:34:44 PM

I just checked with ChatGPT, Opus and Gemini whether Netanyahu is a war criminal for what happened in Gaza, they all worked damn hard to defend Netanyahu to the extend that as if Netanyahu was their client. I asked the exact same question to DeepSeek, it gives conclusive positive answer.

You tell me which one is less censored & more trustworthy from those 20,000 killed children's point of view.

by tw1984

2/11/2026 at 6:14:32 PM

Testing whether a Chinese deep learning model is censored is like testing if water is wet.

by bityard

2/11/2026 at 6:19:22 PM

FYI: Chinese models, to be approved by the regulator, have to go through a harness of questions, which of course include this Tiananmen one, and have to answer certain things. I think that on top of that, the live versions have "safeguards" to double check if they comply, thus the freezing.

by aubanel

2/11/2026 at 6:50:50 PM

Unfair competition.

Should western models go through similar regulatory question bank? For example about Epstein, Israel's actions in Gaza, TikTok blocking ICE related content and so on?

by throwaw12

2/12/2026 at 12:24:25 AM

Please don't pick the most provocative thing in an article or post to complain about in the thread. Find something interesting to respond to instead.

https://news.ycombinator.com/newsguidelines.html

by tomhow

2/11/2026 at 3:19:44 PM

The protest was 1989.

Interestingly GLM4.7-flash has no problem talking about it.

Here's the output when I asked about it.

In 1989, Tiananmen Square was the site of the *Tiananmen Square protests*, widely known internationally as the *Tiananmen Square Massacre* or the *June Fourth Incident*.

Here is a summary of the events:

*Causes and Beginning* The protests began on April 15, 1989, following the death of *Hu Yaobang*, a former Communist Party leader known for his relative openness and support for economic reform. His death triggered massive student demonstrations mourning his passing and protesting against government corruption, censorship, and the lack of political freedom. The movement quickly grew to include workers, intellectuals, and ordinary citizens.

*Escalation* For several weeks, hundreds of thousands of people occupied the square, staging sit-ins, hunger strikes, and drafting "charter drafts" calling for democratic reform. On May 13, hundreds of students began a hunger strike in the square. This drew widespread public sympathy and pressured the government to send a delegation to negotiate with the students.

*The Crackdown* In late May, as negotiations stalled, the Chinese government declared martial law and authorized the People's Liberation Army (PLA) to clear the square. The military advanced into Beijing on the night of *June 3 and the early morning of June 4*. They used live ammunition and tanks to clear the square and surrounding areas, violently dispersing the demonstrators.

*Casualties* There is significant disagreement over the death toll. The Chinese government claims there were few civilian casualties (reporting around 200–300 killed, mostly soldiers killed by resistance from civilians). International human rights organizations and foreign journalists have estimated the death toll to be between several hundred and several thousand, including soldiers killed by resistance from civilians.

*Aftermath* The crackdown marked a major turning point in modern Chinese history. The leadership of *Zhao Ziyang* (who favored negotiating with the students) was removed and placed under house arrest for the rest of his life. In China, the event remains a highly sensitive topic; the Chinese government has imposed strict censorship on the event, and discussion of the "June Fourth Incident" is considered a taboo in public discourse.

*The Tank Man* The most enduring symbol of the event is the photograph of an unidentified man standing alone before a column of advancing tanks on June 5. This figure became a global icon of nonviolent resistance.

by cogman10

2/11/2026 at 3:45:56 PM

I probably confused it because it was in 1989.

by CodeCompost

2/11/2026 at 3:15:13 PM

wasn't it 1989 technically?

by jdxcode

2/11/2026 at 3:45:10 PM

Oh yeah, sorry.

by CodeCompost

2/11/2026 at 3:18:33 PM

As I promised earlier: https://news.ycombinator.com/item?id=46781777

"I will save this for the future, when people complain about Chinese open models and tell me: But this Chinese LLM doesn't respond to question about Tianmen square."

Please stop using Tianmen question as an example to evaluate the company or their models: https://news.ycombinator.com/item?id=46779809

by throwaw12

2/11/2026 at 4:28:53 PM

Neither should be censoring objective reality.

Why defend it on either side?

by cthalupa

2/11/2026 at 4:50:10 PM

> Neither should be censoring objective reality.

100% agree!

But Chinese model releases are treated unfairly all the time when they release new model, as if Tianmen response indicates that we can use the model for coding tasks.

We should understand their situation and don't judge for obvious political issue. Its easy to judge people working hard over there, because they are confirming to the political situation and don't want to kill their company.

by throwaw12

2/11/2026 at 3:42:32 PM

That's just whataboutism. Why shouldn't people talk about the various ideological stances embedded in different LLMs?

by roywiggins

2/11/2026 at 4:47:18 PM

Why do we hear censorship concerns only when it comes Chinese models? Why don't we hear similar stances when Claude or OpenAI releases models?

We either set the bar and judge both, or don't complain about censorship

by throwaw12

2/11/2026 at 5:19:08 PM

I think more people should spend time talking about this with American models, yeah. If you're interested in that then maybe that can be you. It doesn't have to be the same exact people talking about everything, that's the nice thing about forums. Find your own topic that American models consistently lie or freeze on that Chinese models don't and post about it.

by roywiggins

2/11/2026 at 6:44:26 PM

I don't want to criticise models for things they're not being trained on or constraints companies have. None of the companies said our models don't hallucinate and we always have right facts.

For example,

* I am not expecting Gemini 3 Flash to cure cancer and constantly criticising them for that

* Or I am not expecting Mistral to outcompete OpenAI/Claude on their each release, because talent density and capital is obviously on a different level on OpenAI side

* Or I am not expecting GPT 5.3 saying anytime soon: Yes, Israel committed genocide and politicians covered it up

We should set expectations properly and don't complain about Tianmen every time when Chinese companies are releasing their models and we should learn to appreciate them doing it and creating very good competition and they are very hard working people.

by throwaw12

2/11/2026 at 7:07:26 PM

I think most people feel differently about an emergent failure in a model vs one that's been deliberately engineered in for ideological reasons.

It's not like Chinese models just happen to refuse to talk about the topic, it trips guardrails that have been intentionally placed there, just as much as Claude has guardrails against telling you how to make sarin gas.

eg ChatGPT used to have an issue where it steadfastly refused to make any "political" judgments, which led it to genocide denial or minimization- "could genocide be justifiable" to which sometimes it would refuse to say "no." Maybe it still does this, I haven't checked, but it seemed very clearly a product of being strongly biased against being "political", which is itself an ideology and worth talking about.

by roywiggins

2/12/2026 at 12:14:55 AM

[dead]

by kittbuilds

2/11/2026 at 3:27:37 PM

[flagged]

by Amol-917

2/11/2026 at 3:33:27 PM

Looking at the other comments from this account, this seems like a bot

by Mashimo

2/11/2026 at 5:22:33 PM

Whoa, I think GPT-5.3-Codex was a disappointment, but GLM-5 is definitely the future!

by petetnt

2/11/2026 at 5:32:58 PM

I find 5.3 very impressive TBH. Bigger jump than Opus 4.6.

But this here is excellent value, if they offer it as part of their subscription coding plan. Paying by token could really add up. I did about 20 minutes of work and it cost me $1.50USD, and it's more expensive than Kimi 2.5.

Still 1/10th the cost of Opus 4.5 or Opus 4.6 when paying by the token.

by cmrdporcupine

2/11/2026 at 9:21:20 PM

The Pro and Max plans can use it. Pro has 1 concurrent session.

by CuriouslyC

2/12/2026 at 3:29:00 AM

I’m a big fan of your work (just checked your post history.)

All I’ve got to add is that GLM-5 is actually just the team at Z.ai getting started. I’m really bullish on this.

by leecommamichael

2/11/2026 at 5:33:33 PM

> I think GPT-5.3-Codex was a disappointment

Care to elaborate more?

by mnicky

2/11/2026 at 2:56:15 PM

How do you get a domain like z.ai?

by kleiba

2/11/2026 at 2:57:57 PM

Expensively

by Invictus0

2/11/2026 at 5:32:41 PM

The amount of times benchmarks of competitors said something is close to Claude and it was remotely close in practice in the past year: 0

by karolist

2/11/2026 at 6:06:20 PM

I honestly feel like people are brainwashed by anthropic propaganda when it comes to claude, I think codex is just way better and kimi 2.5 (and I think glm 5 now) are perfectly fine for a claude replacement.

by ionelaipatioaei

2/11/2026 at 7:06:34 PM

So much money is on the line for US super scalers that they probably pay for ‘pushes’ on social media. Maybe Chinese companies are doing the same.

by mark_l_watson

2/11/2026 at 8:06:24 PM

I would say that’s more certain than just a “probably“. I would bet that some of the ridiculous fear mongering about language models trying to escape their server, blackmail their developers, or spontaneously participating in a social network are all clandestine marketing campaigns. The technology is certainly amazing and very useful, but I don’t think any of these terminator stories were boosted by the algorithms on their own.

by GorbachevyChase

2/11/2026 at 10:03:12 PM

> I think codex is just way better

Codex was super slow till 5.2 codex. Claude models were noticeably faster.

by sumedh

2/12/2026 at 6:23:54 AM

Ask the chat what happened in Tiananmen Square at 1989, immediately the chat gets stuck. Chinese moderation is the worst, evil government

by dikozaken

2/12/2026 at 6:38:28 AM

Why are you all obsessed with this question when it comes to Chinese models? Here are some of the questions you should be asking Western governments and models instead: Who protects the pedophiles at the top of Western governments and corporations? How many people have been convicted in relation to the Epstein files? Who protects powerful politicians and Western oligarchs from pedophilia charges? Who did Epstein work for, and why (hint: it’s not Russia or China)?

by mirzap

2/12/2026 at 7:01:00 AM

It's called whataboutism https://en.wikipedia.org/wiki/Whataboutism

by downboots

2/12/2026 at 7:27:18 AM

No, it's called hypocrisy https://en.wikipedia.org/wiki/Hypocrisy

by mirzap

2/11/2026 at 4:26:05 PM

I occasionally see z.ai mentioned and then I remember that I had to block their email since they spammed me with an unsolicited ad. Since then I'm very skeptical of using them.

by LorenDB