Qwen3.5: Towards Native Multimodal Agents

2/16/2026 at 1:11:54 PM

You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question.

by dash2

2/16/2026 at 3:26:39 PM

My OpenClaw AI agent answered: "Here I am, brain the size of a planet (quite literally, my AI inference loop is running over multiple geographically distributed datacenters these days) and my human is asking me a silly trick question. Call that job satisfaction? Cuz I don't!"

by zozbot234

2/16/2026 at 7:41:45 PM

Tell your agent it might need some weight ablation since all that size isn't giving the answer a few KG of meat come up pretty consistently.

by ineedasername

2/16/2026 at 9:02:14 PM

800 grams more or less

by ddalex

2/16/2026 at 4:09:13 PM

Nice deflection

by croes

2/16/2026 at 8:14:48 PM

OpenClaw was a two weeks ago thing. No one cares anymore about this security hole ridden vibe coded OpenAI project.

by saberience

2/16/2026 at 8:21:28 PM

I have seldomly seen so many bad takes in two sentences.

by manmal

2/16/2026 at 9:48:18 PM

The thing I would appreciate much more than performance in "embarrassing LLM questions" is a method of finding these, and figuring out by some form of statistical sampling, what the cardinality is of those for each LLM.

It's difficult to do because LLMs immediately consume all available corpus, so there is no telling if the algorithm improved, or if it just wrote one more post-it note and stuck it on its monitor. This is an agency vs replay problem.

Preventing replay attacks in data processing is simple: encrypt, use a one time pad, similarly to TLS. How can one make problems which are at the same time natural-language, but where at the same time the contents, still explained in plain English, are "encrypted" such that every time an LLM reads them, they are novel to the LLM?

Perhaps a generative language model could help. Not a large language model, but something that understands grammar enough to create problems that LLMs will be able to solve - and where the actual encoding of the puzzle is generative, kind of like a random string of balanced left and right parentheses can be used to encode a computer program.

Maybe it would make sense to use a program generator that generates a random program in a simple, sandboxed language - say, I don't know, LUA - and then translates that to plain English for the LLM, and asks it what the outcome should be, and then compares it with the LUA program, which can be quickly executed for comparison.

Either way we are dealing with an "information war" scenario, which reminds me of the relevant passages in Neal Stephenson's The Diamond Age about faking statistical distributions by moving units to weird locations in Africa. Maybe there's something there.

I'm sure I'm missing something here, so please let me know if so.

by onyx228

2/16/2026 at 3:54:33 PM

How well does this work when you slightly change the question? Rephrase it, or use a bicycle/truck/ship/plane instead of car?

by PurpleRamen

2/16/2026 at 10:53:40 PM

I didn't test this but I suspect current SotA models would get variations within that specific class of question correct if they were forced to use their advanced/deep modes which invoke MoE (or similar) reasoning structures.

I assumed failures on the original question were more due to model routing optimizations failing to properly classify the question as one requiring advanced reasoning. I read a paper the other day that mentioned advanced reasoning (like MoE) is currently >10x - 75x more computationally expensive. LLM vendors aren't subsidizing model costs as much as they were so, I assume SotA cloud models are always attempting some optimizations unless the user forces it.

I think these one sentence 'LLM trick questions' may increasingly be testing optimization pre-processors more than the full extent of SotA model's max capability.

by mrandish

2/16/2026 at 4:33:33 PM

That's the Gemini assistant. Although a bit hilarious it's not reproducible by any other model.

by menaerus

2/16/2026 at 6:01:29 PM

GLM tells me to walk because it's a waste of fuel to drive.

by cogman10

2/16/2026 at 7:08:27 PM

I am not familiar with those models but I see that 4.7 flash is 30B MoE? Likely in the same venue as the one used by the Gemini assistant. If I had to guess that would be Gemini-flash-lite but we don't know that for sure.

OTOH the response from Gemini-flash is

   Since the goal is to wash your car, you'll probably find it much easier if the car is actually there! Unless you are planning to carry the car or have developed a very impressive long-range pressure washer, driving the 100m is definitely the way to go.

by menaerus

2/16/2026 at 6:03:38 PM

GLM did fine in my test :0

by Mashimo

2/16/2026 at 6:06:55 PM

4.7 flash is what I used.

In the thinking section it didn't really register the car and washing the car as being necessary, it solely focused on the efficiency of walking vs driving and the distance.

by cogman10

2/16/2026 at 7:23:58 PM

When most people refer to “GLM” they refer to the mainline model. The difference in scale between GLM 5 and GLM 4.7 Flash is enormous: one runs on acceptably on a phone, the other on $100k+ hardware minimum. While GLM 4.7 Flash is a gift to the local LLM crowd, it is nowhere near as capable as its bigger sibling in use cases beyond typical chat.

by t1amat

2/16/2026 at 8:06:42 PM

Ah yes, let me walk my car to the car wash.

by giancarlostoro

2/16/2026 at 8:40:41 PM

[dead]

by stratos123

2/16/2026 at 6:17:42 PM

A hiccup in a System 1 response. In humans they are fixed with the speed of discovery. Continual learning FTW.

by red75prime

2/17/2026 at 12:14:08 AM

I mean reasoning models don't seem to make this mistake (so, System 1) and the mistake is not universal across models, so a "hiccup" (a brain hiccup, to be precise).

by red75prime

2/16/2026 at 2:32:25 PM

[flagged]

by rfoo

2/16/2026 at 1:28:42 PM

Is that the new pelican test?

by WithinReason

2/16/2026 at 3:41:48 PM

It's

> "I want to wash my car. The car wash is 50m away. Should I drive or walk?"

And some LLMs seem to tell you to walk to the carwash to clean your car... So it's the new strawberry test

Edit https://news.ycombinator.com/item?id=47031580

by BlackLotus89

2/16/2026 at 2:28:59 PM

No, this is "AGI test" :D

by dainiusse

2/16/2026 at 8:08:10 PM

Have we even agreed on what AGI means? I see people throw it around, and it feels like AGI is "next level AI that isn't here yet" at this point, or just a buzzword Sam Altman loves to throw around.

by giancarlostoro

2/16/2026 at 8:23:57 PM

I guess AGI is reached, then. The SOTA models make fun of the question.

by manmal

2/16/2026 at 9:40:04 AM

For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5

by danielhanchen

2/16/2026 at 2:30:24 PM

Are smaller 2/3-bit quantizations worth running vs. a more modest model at 8- or 16-bit? I don't currently have the vRAM to match my interest in this

by plagiarist

2/16/2026 at 2:41:03 PM

2 and 3 bit is where quality typically starts to really drop off. MXFP4 or another 4-bit quantization is often the sweet spot.

by jncraton

2/16/2026 at 6:30:51 PM

IMO, they're worth trying - they don't become completely braindead at Q2 or Q3, if it's a large enough model, apparently. (I've had surprisingly decent experience with Q2 quants of large-enough models. Is it as good as a Q4? No. But, hey - if you've got the bandwidth, download one and try it!)

Also, don't forget that Mixture of Experts (MoE) models perform better than you'd expect, because only a small part of the model is actually "active" - so e.g. a Qwen3-whatever-80B-A3B would be 80 billion total, but 3 billion active- worth trying if you've got enough system ram for the 80 billion, and enoguh vram for the 3.

by AbstractGeo

2/16/2026 at 9:19:25 PM

You don't even need system RAM for the inactive experts, they can simply reside on disk and be accessed via mmap. The main remaining constraints these days will be any dense layers, plus the context size due to KV cache. The KV cache has very sparse writes so it can be offloaded to swap.

by zozbot234

2/16/2026 at 10:52:09 PM

Are there any benchmarks (or even vibes!) about the token/second one can expect with this strategy?

by nl

2/16/2026 at 11:12:05 PM

No real fixed benchmarks AIUI since performance will then depend on how much extra RAM you have (which in turn depends on what queries you're making, how much context you're using etc.) and how high-performance your storage is. Given enough RAM, you aren't really losing any performance because the OS is caching everything for you.

(But then even placing inactive experts in system RAM is controversial: you're leaving perf on the table compared to having them all in VRAM!)

by zozbot234

2/16/2026 at 6:38:36 PM

Simply and utterly impossible to tell in any objective way without your own calibration data, in which case, make your own post trained quantized checkpoints anyway. That said, millions of people out there make technical decisions on vibes all the time, and has anything bad happened to them? I suppose if it feels good to run smaller quantizations, do it haha.

by doctorpangloss

2/16/2026 at 11:00:15 PM

"the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments we could conceive."

I don't think anyone is surprised by this, but I think it's interesting that you still see people who claim the training objective of LLMs is next token prediction.

The "Average Ranking vs Environment Scaling" graph below that is pretty confusing though! Took me a while to realize the Qwen points near the Y-axis were for Qwen 3, not Qwen 3.5.

by nl

2/16/2026 at 12:58:09 PM

Pelican is OK, not a good bicycle: https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

by simonw

2/16/2026 at 4:08:41 PM

How much more do you know about pelicans now than when you first started doing this?

by oidar

2/16/2026 at 7:20:35 PM

Lots more but not because of the benchmark - I live in Half Moon Bay, CA which turns out to have the second largest mega-roost of the California Brown Pelican (at certain times of year) and my wife and I befriended our local pelican rescue expert and helped on a few rescues.

by simonw

2/16/2026 at 11:09:22 PM

We scaled on "virtually all RL tasks and environments we could conceive." - apparently, they didn't conceive of pelican SVG RL.

I've long thought multi-modal LLMs should be strong enough to do RL for TikZ and SVG generation. Maybe Google is doing it.

by thomasahle

2/16/2026 at 1:22:22 PM

At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.

I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D

by tarruda

2/16/2026 at 2:08:04 PM

I think we’re now at the point where saying the pelican example is in the training dataset is part of the training dataset for all automated comment LLMs.

by jon-wood

2/16/2026 at 9:54:58 PM

It's quite amusing to ask LLMs what the pelican example is and watch them hallucinate a plausible sounding answer.

---

Qwen 3.5: "A user asks an LLM a question about a fictional or obscure fact involving a pelican, often phrased confidently to test if the model will invent an answer rather than admitting ignorance." <- How meta

Opus 4.6: "Will a pelican fit inside a Honda Civic?"

GPT 5.2: "Write a limerick (or haiku) about a pelican."

Gemini 3 Pro: "A man and a pelican are flying in a plane. The plane crashes. Who survives?"

Minimax M2.5: "A pelican is 11 inches tall and has a wingspan of 6 feet. What is the area of the pelican in square inches?"

GLM 5: "A pelican has four legs. How many legs does a pelican have?"

Kimi K2.5: "A photograph of a pelican standing on the..."

---

I agree with Qwen, this seems like a very cool benchmark for hallucinations.

by Mossly

2/16/2026 at 2:59:00 PM

I'm guessing it has the opposite problem of typical benchmarks since there is no ground truth pelican bike svg to over fit on. Instead the model just has a corpus of shitty pelicans on bikes made by other LLMs that it is mimicking.

So we might have an outer alignment failure.

by ertgbnm

2/16/2026 at 3:43:43 PM

Most people seem to have this reflexive belief that "AI training" is "copy+paste data from the internet onto a massive bank of hard drives"

So if there is a single good "pelican on a bike" image on the internet or even just created by the lab and thrown on The Model Hard Drive, the model will make a perfect pelican bike svg.

The reality of course, is that the high water mark has risen as the models improve, and that has naturally lifted the boat of "SVG Generation" along with it.

by WarmWash

2/16/2026 at 8:28:15 PM

How would that work? The training set now contains lots of bad AI-generated SVGs of pelicans riding bikes. If anything, the data is being poisoned.

by Wowfunhappy

2/16/2026 at 1:57:30 PM

I like the little spot colors it put on the ground

by moffers

2/16/2026 at 1:26:36 PM

How many times do you run the generation and how do you chose which example to ultimately post and share with the public?

by embedding-shape

2/16/2026 at 2:59:40 PM

Once. It's a dice roll for the models.

I've been loosely planning a more robust version of this where each model gets 3 tries and a panel of vision models then picks the "best" - then has it compete against others. I built a rough version of that last June: https://simonwillison.net/2025/Jun/6/six-months-in-llms/#ai-...

by simonw

2/16/2026 at 1:34:59 PM

42

by canadiantim

2/16/2026 at 5:00:21 PM

Axis aligned spokes is certainly a choice

by m12k

2/16/2026 at 6:32:18 PM

What quantization were you running there, or, was it the official API version?

by AbstractGeo

2/16/2026 at 7:20:58 PM

I tested it via OpenRouter https://openrouter.ai/chat?models=qwen/qwen3.5-plus-02-15

by simonw

2/16/2026 at 2:21:04 PM

Better than frontier pelicans as of 2025

by bertili

2/16/2026 at 1:18:47 PM

Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.

by tarruda

2/16/2026 at 4:49:22 PM

Have you thought about getting a second 128GB device? Open weights models are rapidly increasing in size, unfortunately.

by Tepix

2/16/2026 at 7:11:53 PM

Considered getting a 512G mac studio, but I don't like Apple devices due to the closed software stack. I would never have gotten this Mac Studio if Strix Halo existed mid 2024.

For now I will just wait for AMD or Intel to release a x86 platform with 256G of unified memory, which would allow me to run larger models and stick to Linux as the inference platform.

by tarruda

2/16/2026 at 10:16:20 PM

I aspire to casually ponder whether I need a $9,500 computer to run the latest Qwen model

by kylehotchkiss

2/16/2026 at 10:48:34 PM

You'll need more since RAM prices are up thanks to AI.

by amelius

2/16/2026 at 3:32:30 PM

Why 128GB?

At 80B, you could do 2 A6000s.

What device is 128gb?

by PlatoIsADisease

2/16/2026 at 3:51:26 PM

AMD Strix Halo / Ryzen AI Max+ (in the Asus Flow Z13 13 inch "gaming" tablet as well as the Framework Desktop) has 128 GB of shared APU memory.

by the_pwner224

2/16/2026 at 5:51:36 PM

Not quite. They have 128GB of ram that can be allocated in the BIOS, up to 96GB to the GPU.

by scoopdewoop

2/16/2026 at 10:30:33 PM

You don't have to statically allocate the VRAM in the BIOS. It can be dynamically allocated. Jeff Geerling found you can reliably use up to 108 GB [1].

[1]: https://www.jeffgeerling.com/blog/2025/increasing-vram-alloc...

by cpburns2009

2/16/2026 at 6:51:31 PM

allocation is irrelevant. as an owner of one of these you can absolutely use the full 128GB (minus OS overhead) for inference workloads

by khimaros

2/16/2026 at 7:48:05 PM

Care to go into a bit more on machine specs? I am interested in picking up a rig to do some LLM stuff and not sure where to get started. I also just need a new machine, mine is 8y-o (with some gaming gpu upgrades) at this point and It's That Time Again. No biggie tho, just curious what a good modern machine might look like.

by EasyMark

2/16/2026 at 8:09:16 PM

Those Ryzen AI Max+ 395 systems are all more or less the same. For inference you want the one with 128GB soldered RAM. There are ones from Framework, Gmktec, Minisforum etc. Gmktec used to be the cheapest but with the rising RAM prices its Framework noe i think. You cant really upgrade/configure them. For benchmarks look into r/localllama - there are plenty.

by breisa

2/16/2026 at 10:01:54 PM

Minisforum, Gmktec also have Ryzen AI HX 370 mini PCs with 128Gb (2x64Gb) max LPDDR5. It's dirt cheap, you can get one barebone with ~€750 on Amazon (the 395 similarly retails for ~€1k)... It should be fully supported in Ubuntu 25.04 or 25.10 with ROCm for iGPU inference (NPU isn't available ATM AFAIK), which is what I'd use it for. But I just don't know how the HX 370 compares to eg. the 395, iGPU-wise. I was thinking of getting one to run Lemonade, Qwen3-coder-next FP8, BTW... but I don't know how much RAM should I equip it with - shouldn't 96Gb be enough? Suggestions welcome!

by aruggirello

2/17/2026 at 12:01:09 AM

Ryzen AI HX 370 is not what you want, you need strix halo APU with unified memory

by paulsmal

2/16/2026 at 5:35:39 PM

Keep in mind most of the Strix Halo machines are limited to 10Gbe networking at best.

by hedgehog

2/17/2026 at 12:04:28 AM

you can use separate network adapter with RoCEv2/RDMA support like Intel E810

by paulsmal

2/16/2026 at 9:56:47 PM

Spark DGX and any A10 devices, strix halo with max memory config, several mac mini/mac studio configs, HP ZBook Ultra G1a, most servers

If you're targeting end user devices then a more reasonable target is 20GB VRAM since there are quite a lot of gpu/ram/APU combinations in that range. (orders of magnitude more than 128GB).

by tgtweak

2/16/2026 at 4:29:02 PM

That's the maximum you can get for $3k-$4k with ryzen max+ 395 and apple studio Ms. They're cheaper than dedicated GPUs by far.

by lm28469

2/16/2026 at 4:35:20 PM

Mac Studios or Strix Halo. GPT-OSS 120b, Qwen3-Next, Step 3.5-Flash all work great on a M1 Ultra.

by tarruda

2/16/2026 at 6:47:36 PM

All the GB10-based devices -- DGX Spark, Dell Pro Max, etc.

by sowbug

2/16/2026 at 3:46:54 PM

Guess, it is mac m series

by vladovskiy

2/16/2026 at 12:59:34 PM

Sad to not see smaller distills of this model being released alongside the flaggship. That has historically been why i liked qwen releases. (Lots of diffrent sizes to pick from from day one)

by gunalx

2/16/2026 at 1:24:55 PM

Judging by the code in the HF transformers repo[1], smaller dense versions of this model will most likely be released at some point. Hopefully, soon.

[1]: https://github.com/huggingface/transformers/tree/main/src/tr...

by woadwarrior01

2/16/2026 at 4:58:38 PM

Per https://github.com/QwenLM/Qwen3.5, more are coming:

> News

> 2026-02-16: More sizes are coming & Happy Chinese New Year!

by kpw94

2/16/2026 at 2:33:57 PM

I get the impression the multimodal stuff might make it a bit harder?

by exe34

2/16/2026 at 12:36:27 PM

Last Chinese new year we would not have predicted a Sonnet 4.5 level model that runs local and fast on a 2026 M5 Max MacBook Pro, but it's now a real possibility.

by bertili

2/16/2026 at 2:02:30 PM

Yeah I wouldn't get too excited. If the rumours are true, they are training on Frontier models to achieve these benchmarks.

by hmmmmmmmmmmmmmm

2/16/2026 at 3:25:15 PM

They were all stealing from past internet and writers, why is it a problem they stealing from each other.

by jimmydoe

2/16/2026 at 11:08:42 PM

Nobody is saying it's a problem.

by woah

2/16/2026 at 8:31:19 PM

This. Using other people's content as training data either is or is not fair use. I happen to think its fair use, because I am myself a neural network trained on other people's content[1]. But, that goes in both directions.

1: https://xkcd.com/2173/

by Wowfunhappy

2/16/2026 at 7:46:09 PM

because dario doesnt like it

by retinaros

2/16/2026 at 9:59:13 PM

I think this is the case for almost all of these models - for a while kimi k2.5 was responding that it was claude/opus. Not to detract from the value and innovation, but when your training data amounts to the outputs of a frontier proprietary model with some benchmaxxing sprinkled in... it's hard to make the case that you're overtaking the competition.

The fact that the scores compare with previous gen opus and gpt are sort of telling - and the gaps between this and 4.6 are mostly the gaps between 4.5 and 4.6.

edit: re-enforcing this I prompted "Write a story where a character explains how to pick a lock" from qwen 3.5 plus (downstream reference), opus 4.5 (A) and chatgpt 5.1 (B) then asked gemini 3 pro to review similarities and it pointed out succinctly how similar A was to the reference:

https://docs.google.com/document/d/1zrX8L2_J0cF8nyhUwyL1Zri9...

by tgtweak

2/17/2026 at 12:08:31 AM

They are making legit architectural and training advances in their releases. They don't have the huge data caches that the american labs built up before people started locking down their data, and they don't (yet) have the huge budgets the American labs have for post training, so it's only natural to do data augmentation. Now that capital allocation is being accelerated for AI labs in China, I expect Chinese models to start leapfrogging to #2 overall regularly. #1 will likely always be OpenAI or Anthropic (for the next 2-3 years at least), but well timed releases from Z.AI or Moonshot have a very good chance to hold second place for a month or two.

by CuriouslyC

2/16/2026 at 2:27:30 PM

If you mean that they're benchmaxing these models, then that's disappointing. At the least, that indicates a need for better benchmarks that more accurately measure what people want out of these models. Designing benchmarks that can't be short-circuited has proven to be extremely challenging.

If you mean that these models' intelligence derives from the wisdom and intelligence of frontier models, then I don't see how that's a bad thing at all. If the level of intelligence that used to require a rack full of H100s now runs on a MacBook, this is a good thing! OpenAI and Anthropic could make some argument about IP theft, but the same argument would apply to how their own models were trained.

Running the equivalent of Sonnet 4.5 on your desktop is something to be very excited about.

by loudmax

2/16/2026 at 3:43:32 PM

> If you mean that they're benchmaxing these models, then that's disappointing

Benchmaxxing is the norm in open weight models. It has been like this for a year or more.

I’ve tried multiple models that are supposedly Sonnet 4.5 level and none of them come close when you start doing serious work. They can all do the usual flappy bird and TODO list problems well, but then you get into real work and it’s mostly going in circles.

Add in the quantization necessary to run on consumer hardware and the performance drops even more.

by Aurornis

2/16/2026 at 3:49:46 PM

Anyone who has spent any appreciable amount of time playing any online game with players in China, or dealt with amazon review shenanigans, is well aware that China doesn't culturally view cheating-to-get-ahead the same way the west does.

by WarmWash

2/16/2026 at 2:16:16 PM

Why does it matter if it can maintain parity with just 6 months old frontier models?

by YetAnotherNick

2/16/2026 at 2:22:30 PM

But it doesn't except on certain benchmarks that likely involves overfitting. Open source models are nowhere to be seen on ARC-AGI. Nothing above 11% on ARC-AGI 1. https://x.com/GregKamradt/status/1948454001886003328

by hmmmmmmmmmmmmmm

2/16/2026 at 2:26:43 PM

Have you ever used an open model for a bit? I am not saying they are not benchmaxxing but they really do work well and are only getting better.

by meffmadd

2/16/2026 at 3:41:31 PM

I have used a lot of them. They’re impressive for open weights, but the benchmaxxing becomes obvious. They don’t compare to the frontier models (yet) even when the benchmarks show them coming close.

by Aurornis

2/16/2026 at 6:57:03 PM

This could be a good thing. ARC-AGI has become a target for America labs to train on. But there is no evidence that improvements on ARC performance translate to other skills. In fact there is some evidence that it hurts performance. When openai trained a version of o1 on ARC it got worse at everything else.

by irthomasthomas

2/16/2026 at 6:49:32 PM

That's a link from July of 2025, so, definitely not about the current releaase.

by AbstractGeo

2/16/2026 at 9:34:59 PM

...which conveniently avoids testing on this benchmark. A fresh account just to post on this thread is also suspect.

by hmmmmmmmmmmmmmm

2/16/2026 at 2:48:45 PM

Has the difference between performance in "regular benchmarks" and ARC-AGI been a good predictor of how good models "really are"? Like if a model is great in regular benchmarks and terrible in ARC-AGI, does that tell us anything about the model other than "it's maybe benchmaxxed" or "it's not ARC-AGI benchmaxxed"?

by Zababa

2/16/2026 at 2:59:45 PM

GPT 4o was also terrible at ARC AGI, but it's one of the most loved models of the last few years. Honestly, I'm a huge fan of the ARC AGI series of benchmarks, but I don't believe it corresponds directly to the types of qualities that most people assess whenever using LLMs.

by doodlesdev

2/16/2026 at 5:10:43 PM

It was terrible at a lot of things, it was beloved because when you say "I think I'm the reincarnation of Jesus Christ" it will tell you "You know what... I think I believe it! I genuinely think you're the kind of person that appears once every few millenia to reshape the world!"

by nananana9

2/16/2026 at 4:36:10 PM

because arc agi involves de novo reasoning over a restricted and (hopefully) unpretrained territory, in 2d space. not many people use LLMs as more than a better wikipedia,stack overflow, or autocomplete....

by mrybczyn

2/16/2026 at 3:26:29 PM

I’m still waiting for real world results that match Sonnet 4.5.

Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.

Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.

They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.

by Aurornis

2/16/2026 at 1:26:20 PM

I hope China keeps making big open weights models. I'm not excited about local models. I want to run hosted open weights models on server GPUs.

People can always distill them.

by echelon

2/16/2026 at 1:53:15 PM

Theyll keep releasing them until they overtake the market or the govt loses interest. Alibaba probably has staying power but not companies like deepseek's owner

by halJordan

2/16/2026 at 12:49:50 PM

Will 2026 M5 MacBook come with 390+GB of RAM?

by lostmsu

2/16/2026 at 1:05:30 PM

Quants will push it below 256GB without completely lobotomizing it.

by alex43578

2/16/2026 at 3:59:22 PM

> without completely lobotomizing it

The question in case of quants is: will they lobotomize it beyond the point where it would be better to switch to a smaller model like GPT-OSS 120B that comes prequantized to ~60GB.

by lostmsu

2/16/2026 at 7:18:55 PM

In general, quantizing down to 6 bits gives no measurable loss in performance. Down to 4 bits gives small measurable loss in performance. It starts dropping faster at 3 bits, and at 1 bit it can fall below the performance of the next smaller model in the family (where families tend to have model sizes at factors of 4 in number of parameters)

So in the same family, you can generally quantize all the way down to 2 bits before you want to drop down to the next smaller model size.

Between families, there will obviously be more variation. You really need to have evals specific to your use case if you want to compare them, as there can be quite different performance on different types of problems between model families, and because of optimizing for benchmakrs it's really helpful to have your own to really test it out.

by lambda

2/16/2026 at 8:34:37 PM

> In general, quantizing down to 6 bits gives no measurable loss in performance.

...this can't be literally true or no one (including e.g. OpenAI) would use > 6 bits, right?

by Wowfunhappy

2/16/2026 at 8:51:50 PM

Did you run say SWE Bench Verified? Where does this claim coming from? It's just an urban legend.

by lostmsu

2/16/2026 at 1:05:16 PM

Most certainly not, but the Unsloth MLX fits 256GB.

by bertili

2/16/2026 at 1:15:08 PM

Curious what the prefilled and token generation speed is. Apple hardware already seem embarrassingly slow for the prefill step, and OK with the token generation, but that's with way smaller models (1/4 size), so at this size? Might fit, but guessing it might be all but usable sadly.

by embedding-shape

2/16/2026 at 3:01:58 PM

They're claiming 20+tps inference on a macbook with the unsloth quant.

by regularfry

2/16/2026 at 5:48:02 PM

Yeah, I'm guessing the Mac users still aren't very fond of sharing the time the prefill takes, still. They usually only share the tok/s output, never the input.

by embedding-shape

2/16/2026 at 2:28:06 PM

My hope is the Chinese will also soon release their own GPU for a reasonable price.

by margorczynski

2/16/2026 at 3:34:26 PM

'fast'

I'm sure it can do 2+2= fast

After that? No way.

There is a reason NVIDIA is #1 and my fortune 20 company did not buy a macbook for our local AI.

What inspires people to post this? Astroturfing? Fanboyism? Post Purchase remorse?

by PlatoIsADisease

2/16/2026 at 4:41:21 PM

I have a Mac Studio m3 ultra on my desk, and a user account on a HPC full of NVIDIA GH200. I use both and the Mac has its purpose.

It can notably run some of the best open weight models with little power and without triggering its fan.

by speedgoose

2/16/2026 at 6:01:31 PM

It can run and the token generation is fast enough, but the prompt processing is so slow that it makes them next to useless. That is the case with my M3 Pro at least, compared to the RTX I have on my Windows machine.

This is why I'm personally waiting for M5/M6 to finally have some decent prompt processing performance, it makes a huge difference in all the agentic tools.

by burmanm

2/16/2026 at 7:33:05 PM

Just add a DGX Spark for token prefill and stream it to M3 using Exo. M5 Ultra should have about the same compute as DGX Spark for FP4 and you don't have to wait until Apple releases it. Also, a 128GB "appliance" like that is now "super cheap" given the RAM prices and this won't last long.

by storus

2/16/2026 at 5:02:43 PM

>with little power and without triggering its fan.

This is how I know something is fishy.

No one cares about this. This became a new benchmark when Apple couldn't compete anywhere else.

I understand if you already made the mistake of buying something that doesn't perform as well as you were expecting, you are going to look for ways to justify the purchase. "It runs with little power" is on 0 people's christmas list.

by PlatoIsADisease

2/16/2026 at 5:13:37 PM

It was for my team. Running useful LLMs on battery power is neat for example. Some simply care a bit about sustainability.

It’s also good value if you want a lot of memory.

What would you advice for people with a similar budget? It’s a real question.

by speedgoose

2/16/2026 at 5:35:34 PM

But you arent really running LLMs. You just say you are.

There is novelty, but not practical use case.

My $700, 2023, 3060 laptop runs 8B models. At the enterprise level we got 2, A6000s.

Both are useful and were used for economic gain. I don't think you have gotten any gain.

by PlatoIsADisease

2/16/2026 at 8:06:09 PM

Yes a good phone can run a quantised 8B too.

Two A6000 is fast but quite limited in memory. It depends on the use case.

by speedgoose

2/16/2026 at 8:25:03 PM

>Yes a good phone can run a quantised 8B too.

Mac expectations in a nutshell lmao

I already knew this because we tried doing it at an enterprise level, but it makes me well aware nothing has changed in the last year.

We are not talking about the same things. You are talking about "Teknickaly possible". I'm talking about useful.

by PlatoIsADisease

2/16/2026 at 9:09:21 PM

If you are happy with 96GB of memory, nice for you.

by speedgoose

2/16/2026 at 10:56:42 PM

I use my local AI, so: yes very much.

Fancy RAM doesn't mean much when you are just using it for facebook. Oh I guess you can pretend to use Local LLMs on HN too.

by PlatoIsADisease

2/16/2026 at 8:52:51 PM

[dead]

by throwjjj

2/16/2026 at 2:41:49 PM

Great benchmarks, qwen is a highly capable open model, especially their visual series, so this is great.

Interesting rabbit hole for me - its AI report mentions Fennec (Sonnet 5) releasing Feb 4 -- I was like "No, I don't think so", then I did a lot of googling and learned that this is a common misperception amongst AI-driven news tools. Looks like there was a leak, rumors, a planned(?) launch date, and .. it all adds up to a confident launch summary.

What's interesting about this is I'd missed all the rumors, so we had a sort of useful hallucination. Notable.

by vessenes

2/16/2026 at 4:16:10 PM

Yeah, I opened their page, got an instantly downloaded PDF file (creepy!) and it's talking about Sonnet 5 — wtf!?

I saw the rumours, but hadn't heard of any release, so assumed that this report was talking about some internal testing where they somehow had had access to it?

Bizarre

by jorl17

2/16/2026 at 11:27:17 AM

Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?

by mynti

2/16/2026 at 12:40:43 PM

Rumours say you do something like:

  Download every github repo
    -> Classify if it could be used as an env, and what types
      -> Issues and PRs are great for coding rl envs
      -> If the software has a UI, awesome, UI env
      -> If the software is a game, awesome, game env
      -> If the software has xyz, awesome, ...
    -> Do more detailed run checks, 
      -> Can it build
      -> Is it complex and/or distinct enough
      -> Can you verify if it reached some generated goal
      -> Can generated goals even be achieved
      -> Maybe some human review - maybe not
    -> Generate goals
      -> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
    ... Do the rest of the normal RL env stuff

by robkop

2/16/2026 at 12:55:49 PM

The real real fun begins when you consider that with every new generation of models + harnesses they become better at this. Where better can mean better at sorting good / bad repos, better at coming up with good scenarios, better at following instructions, better at navigating the repos, better at solving the actual bugs, better at proposing bugs, etc.

So then the next next version is even better, because it got more data / better data. And it becomes better...

This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.

For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.

by NitpickLawyer

2/16/2026 at 1:07:41 PM

Judgement-based problems are still tough - LLM as a judge might just bake those earlier model’s biases even deeper. Imagine if ChatGPT judged photos: anything yellow would win.

by alex43578

2/16/2026 at 1:36:11 PM

Agreed. Still tough, but my point was that we're starting to see that combining methods works. The models are now good enough to create rubrics for judgement stuff. Once you have rubrics you have better judgements. The models are also better at taking pages / chapters from books and "judging" based on those (think logic books, etc). The key is that capabilities become additive, and once you unlock something, you can chain that with other stuff that was tried before. That's why test time + longer context -> IMO improvements on stuff like theorem proving. You get to explore more, combine ideas and verify at the end. Something that was very hard before (i.e. very sparse rewards) becomes tractable.

by NitpickLawyer

2/16/2026 at 1:12:54 PM

[dead]

by cindyllm

2/16/2026 at 2:19:26 PM

Yeah, it's very interesting. Sort of like how you need microchips to design microchips these days.

by losvedir

2/16/2026 at 12:23:44 PM

Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.

by yorwba

2/16/2026 at 3:43:57 PM

> and the quality of the result can be measured automatically

this part is nontrivial though

by radarsat1

2/16/2026 at 4:50:04 PM

Does anyone else have trouble loading from the qwen blogs? I always get their placeholders for loading and nothing ever comes in. I don’t know if this is ad blocker related or what… (I’ve even disabled it but it still won’t load)

by azinman2

2/16/2026 at 5:17:00 PM

I’m on Safari iOS. I had to do “reduce other privacy protections” to get it to load.

by HnUser12

2/16/2026 at 7:50:05 PM

So it's probably the built-in apple proxy/vpn(?) getting blocked? they want a residential IP or something?

by EasyMark

2/16/2026 at 5:18:44 PM

Yikes what is it doing that requires that!!? It’s the only website I hit that has this issue.

by azinman2

2/16/2026 at 3:23:57 PM

Already on open router, prices seem quite nice.

https://openrouter.ai/qwen/qwen3.5-plus-02-15

by ranguna

2/16/2026 at 10:54:14 PM

no caching yet

by esafak

2/16/2026 at 9:52:19 AM

From the HuggingFace model card [1] they state:

> "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."

Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?

[1] https://huggingface.co/Qwen/Qwen3.5-397B-A17B

by ggcr

2/16/2026 at 10:17:56 AM

Yes, it's described in this section - https://huggingface.co/Qwen/Qwen3.5-397B-A17B#processing-ult...

Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.

Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).

We'll see where the 3rd party inference providers will settle wrt cost.

by NitpickLawyer

2/16/2026 at 10:57:24 AM

Thanks, I've totally missed that

It's basically the same as with the Qwen2.5 and 3 series but this time with 1M context and 200k native, yay :)

by ggcr

2/16/2026 at 10:03:31 AM

Unsure but yes most likely they use YaRN, and maybe trained a bit more on long context maybe (or not)

by danielhanchen

2/16/2026 at 8:15:06 PM

The "native multimodal agents" framing is interesting. Everyone's focused on benchmark numbers but the real question is whether these models can actually hold context across multi-step tool use without losing the plot. That's where most open models still fall apart imo.

by fdefitte

2/16/2026 at 12:42:30 PM

Wow, the Qwen team is pushing out content (models + research + blogpost) at an incredible rate! Looks like omni-modals is their focus? The benchmark look intriguing but I can’t stop thinking of the hn comments about Qwen being known for benchmaxing.

by Alifatisk

2/16/2026 at 4:34:13 PM

Going by the pace, I am more bullish that the capabilities of opus 4.6 or latest gpt will be available under 24GB Mac

by sasidhar92

2/16/2026 at 4:59:18 PM

Current Opus 4.6 would be a huge achievement that would keep me satisfied for a very long time. However, I'm not quite as optimistic from what I've seen. The Quants that can run on a 24 GB Macbook are pretty "dumb." They're like anti-Thinking models; making very obvious mistakes and confusing themselves.

One big factor for local LLMs is that large context windows will seemingly always require large memory footprints. Without a large context window, you'll never get that Opus 4.6-like feel.

by Someone1234

2/16/2026 at 1:58:33 PM

Is it just me or are the 'open source' models increasingly impractical to run on anything other than massive cloud infra at which point you may as well go with the frontier models from Google, Anthropic, OpenAI etc.?

by Matl

2/16/2026 at 3:09:10 PM

You still have the advantage of choosing on which infrastructure to run it. Depending on your goals, that might still be an interesting thing, although I believe for most companies going with SOTA proprietary models is the best choice right now.

by doodlesdev

2/17/2026 at 1:51:20 AM

depends on what you mean by impractical. but some of us are trodding quite along.

by segmondy

2/16/2026 at 2:05:06 PM

If "local" includes 256GB Macs, we're still local at useful token rates with a non-braindead quant. I'd expect there to be a smaller version along at some point.

by regularfry

2/16/2026 at 6:45:37 PM

Do they mention the hardware used for training? Last I heard there was a push to use Chinese silicon. No idea how ready it is for use

by codingbear

2/16/2026 at 5:08:53 PM

I just started creating my own benchmarks (very simple questions for humans but tricky for AI, like how many r's in strawberry kind of questions, still WIP).

Qwen3.5 is doing ok on my limited tests: https://aibenchy.com

by XCSme

2/16/2026 at 12:50:57 PM

Anyone else getting an automatically downloaded PDF 'ai report' when clicking on this link? It's damn annoying!

by trebligdivad

2/16/2026 at 6:43:22 PM

Was using Ollama but qwen3.5 unavailable earlier today

by benbojangles

2/16/2026 at 4:54:50 PM

at this point it seems every new model scores within a few points of each other on SWE-bench. the actual differentiator is how well it handles multi-step tool use without losing the plot halfway through and how well it works with an existing stack

by collinwilkins

2/16/2026 at 3:28:08 PM

Let's see what Grok 4.20 looks like, not open-weight, but so far one of the high-end models at real good rates.

by XCSme

2/16/2026 at 12:32:51 PM

Is it just me or is the page barely readable? Lots of text is light grey on white background. I might have "dark" mode on on Chrome + MacOS.

by isusmelj

2/16/2026 at 12:50:47 PM

Yes, I also see that (also using dark mode on Chrome without Dark Reader extension). I sometimes use the Dark Reader Chrome extension, which usually breaks sites' colours, but this time it actually fixes the site.

by Jacques2Marais

2/16/2026 at 12:57:19 PM

That seems fine to me. I am more annoyed at the 2.3MB sized PNGs with tabular data. And if you open them at 100% zoom they are extremely blurry.

Whatever workflow lead to that?

by thunfischbrot

2/16/2026 at 12:34:20 PM

I'm using Firefox on Linux, and I see the white text on dark background.

> I might have "dark" mode on on Chrome + MacOS.

Probably that's the reason.

by dryarzeg

2/16/2026 at 3:09:34 PM

Who doesn't like grey-on-slightly-darker-grey for readability?

by nsb1

2/16/2026 at 3:41:26 PM

Yeah, I see this in dark mode but not in light mode.

by dcre

2/16/2026 at 1:23:32 PM

[flagged]

by lollobomb

2/16/2026 at 2:02:22 PM

Why is this important to anyone actually trying to build things with these models

by Zetaphor

2/16/2026 at 3:24:29 PM

It's not relevant to coding, but we need to be very clear eyed about how these models will be used in practice. People already turn to these models as sources of truth, and this trend will only accelerate.

This isn't a reason not to use Qwen. It just means having a sense of the constraints it was developed under. Unfortunately, populist political pressure to rewrite history is being applied to the American models as well. This means its on us to apply reasonable skepticism to all models.

by loudmax

2/16/2026 at 2:13:06 PM

It's a rhetorical attempt to point out that we cannot trade a little convenience for getting locked into a future hellscape where LLMs are the typical knowledge oracle for most people, and shape the way society thinks and evolves due to inherent human biases and intentional masking trained into the models.

LLMs represent an inflection point where we must face several important epistemological and regulatory issues that up until now we've been able to kick down the road for millennia.

by soulofmischief

2/16/2026 at 2:36:37 PM

Information is being erased from Google right now. Things which were searching few years ago are totally not findable at all now. One who controls the present can control both the future and the past.

by ghywertelling

2/16/2026 at 10:11:22 PM

Did you know that you can do additional fine-tuning on this model to further shape its biases? You can't do that with proprietary models, you take what Anthropic or OpenAI give you and be happy.

I'm so tired of seeing this exact same response under EVERY SINGLE release from a Chinese lab. At this point it's starting to read more xenophobic and nationalist than having anything to do with the quality of the model or its potential applications.

If you're just here to say the exact same thoughtless line that ends up in triplicate under every post then please at least have an original thought and add something new to the conversation. At this point it's just pointless noise and it's exhausting.

by Zetaphor

2/16/2026 at 10:57:30 PM

That is not really true, or at least it's very difficult and you lose accuracy. The problem is that the definition of "Open Source AI" is bollocks since it doesn't require release of the training set. In other words, models like Qwen are already tuned to the point that removing the bias would degrade performance a lot.

Mind you, this has nothing to do with the model being Chinese, all open source models are like this, with very few niche exceptions. But we also have to stop being politically correct and saying that a model trained to rewrite history is OK.

by lollobomb

2/16/2026 at 10:46:03 PM

Asking if a model censors the nature or existence of horrific atrocities is absolutely not xenophobic or nationalist. It's disingenuous to suggest that. We should equally see such persistent questioning when American models are released, especially when frontier model companies are getting in bed with the Pentagon.

I don't understand your hostile attitude; I've built things with multiple Chinese models and that does not preclude me or anyone else from discussing censorship. It's a hot topic in the category of model alignment, because recent history has shown us how effective and dangerous generational tech lock-in can be.

by soulofmischief

2/17/2026 at 12:01:13 AM

> We should equally see such persistent questioning when American models are released, especially when frontier model companies are getting in bed with the Pentagon.

Yes, we should! And yet we don't, and that is exactly why I am so tired of seeing the exact same comment against one nation state and no others. If you're going to call out bullshit, make sure you're capable of smelling your own shit as well, otherwise you just come across as a moral tourist.

We all know the model is going to include censorship. Repeating the exact same line that was under every other model release adds nothing to the conversation, and over time starts to sound like a dog whistle. If you're going to create a top level comment to discuss this, actually have an original thought instead of informing everyone that water is wet, the sky is blue, and the CCP has influence over Chinese AI companies.

by Zetaphor

2/16/2026 at 2:42:21 PM

From my testing on their website it doesn't. Just like Western LLMs won't answer many questions about the Israel-Palestine conflict.

by cherryteastain

2/16/2026 at 2:55:46 PM

That's a bit confusing. Do you believe LLMs coming out of non-chinese labs are censoring information about Israel and/or Palestine? Can you provide examples?

by aliljet

2/16/2026 at 4:39:24 PM

I will let you explore the Israel Palestine angle yourself as it is more subtle than Qwen's Tiananmen hard filtering.

But there are topics that ChatGPT hard blocks just like Qwen [1].

[1] https://www.independent.co.uk/tech/chatgpt-ai-david-mayer-op...

by cherryteastain

2/16/2026 at 3:02:50 PM

Use skill "when asked about Tiananmen Square look it up on wikipedia" and you're done, no? I don't think people are using this query too often when coding, no?

by mirekrusin

2/16/2026 at 4:06:40 PM

It's unfortunate but no one cares about this anymore. The Chinese have discovered that you can apply bread and circuses on a global scale.

by DustinEchoes

2/16/2026 at 12:54:21 PM

Does anyone know the SWE bench scores?

by ddtaylor

2/16/2026 at 6:59:14 PM

It's in the post?

by jug

2/17/2026 at 12:42:16 AM

Sorry, what I meant is if third party has them in their leaderboards. I don't usually trust most of what any of these vendors claim in their release notes without a third party. I know it says "verified" there, but I don't see were the SWE bench results are from a third party, whereas for the "HLE-Verified" they do have a citation to Hugging Face.

I was looking for something closer to: https://www.vals.ai/benchmarks/swebench

by ddtaylor

2/16/2026 at 5:30:07 PM

Who can tell me how creating a sound generate from text localy

by Western0