Step 3.5 Flash – Open-source foundation model, supports deep reasoning at speed

2/19/2026 at 12:45:46 PM

This is probably one of the most underrated LLMs releases in the past few months. In my local testing with a 4-bit quant (https://huggingface.co/ubergarm/Step-3.5-Flash-GGUF/tree/mai...), it surpasses every other LLM I was able to run locally, including Minimax 2.5 and GLM-4.7, though I was only able to run GLM with a 2-bit quant. Some highlights:

- Very context efficient: SWA by default, on a 128G mac I can run the full 256k context or two 128k context streams. - Good speeds on macs. On my M1 Ultra I get 36 t/s tg and 300 t/s pp. Also, these speeds degrade very slowly as context increases: At 100k prefill, it has 20 t/s tg and 129 t/s pp. - Trained for agentic coding. I think it is trained to be compatible with claude code, but it works fine with other CLI harnesses except for Codex (due to the patch edit tool which can confuse it).

This is the first local LLM in the 200B parameter range that I find to be usable with a CLI harness. Been using it a lot with pi.dev and it has been the best experience I had with a local LLM doing agentic coding.

There are a few drawbacks though:

- It can generate some very long reasoning chains. - Current release has a bug where sometimes it goes into an infinite reasoning loop: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...

Hopefully StepFun will do a new release which addresses these issues.

BTW StepFun seems to be the same company that released ACEStep (very good music generation model). At least StepFun is mentioned in ComfyUI docs https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1

by tarruda

2/19/2026 at 6:42:04 PM

Have you tried Qwen3 Coder Next? I've been testing it with OpenCode and it seems to work fairly well with the harness. It occasionally calls tools improperly but with Qwen's suggested temperature=1 it doesn't seem to get stuck. It also spends a reasonable amount of time trying to do work.

I had tried Nemotron 3 Nano with OpenCode and while it kinda worked its tool use was seriously lacking because it just leans on the shell tool for most things. For example, instead of using a tool to edit a file it would just use the shell tool and run sed on it.

That's the primary issue I've noticed with the agentic open weight models in my limited testing. They just seem hesitant to call tools that they don't recognize unless explicitly instructed to do so.

by sosodev

2/19/2026 at 7:01:19 PM

I did play with Qwen3 Coder Next a bit, but didn't try it in a coding harness. Will give it a shot later.

by tarruda

2/19/2026 at 6:42:35 PM

Is getting something like M3 Ultra with 512GB ram and doing oss models going to be cheaper for the next year or two compared to paying for claude / codex?

Did anyone do this kind of math?

by petethepig

2/19/2026 at 7:00:19 PM

No, it is not cheaper. An M3 ultra with 512GB costs $10k which would give you 50 months of Claude or Codex pro plans.

However, if you check the prices on Chinese models (which are the only ones you would be able to run on a Mac), they are much cheaper than the US plans. It would take you forever to get to the $10k

And of course this is not even considering energy costs or running inference on your own hardware (though Macs should be quite efficient there).

by tarruda

2/19/2026 at 5:17:59 PM

Curious on how (if?) changes to the inference engine can fix the issue with infinitely long reasoning loops.

It’s my layman understanding that would have to be fixed in the model weights itself?

by ipython

2/19/2026 at 6:57:04 PM

There's an AMA happening on reddit and they said it will be fixed in the next release: https://www.reddit.com/r/LocalLLaMA/comments/1r8snay/ama_wit...

by tarruda

2/19/2026 at 6:30:09 PM

I think there are multiple ways these infinite loops can occur. It can be an inference engine bug because the engine doesn't recognize the specific format of tags/tokens the model generates to delineate the different types of tokens (thinking, tool calling, regular text). So the model might generate a "I'm done thinking" indicator but the engine ignores it and just keeps generating more "thinking" tokens.

It can also be a bug in the model weights because the model is just failing to generate the appropriate "I'm done thinking" indicator.

You can see this described in this PR https://github.com/ggml-org/llama.cpp/pull/19635

Apparently Step 3.5 Flash uses an odd format for its tags so llama.cpp just doesn't handle it correctly.

by sosodev

2/19/2026 at 7:26:10 PM

> so llama.cpp just doesn't handle it correctly.

It is a bug in the model weights and reproducible in their official chat UI. More details here: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...

by tarruda

2/19/2026 at 7:33:15 PM

I see. It seems the looping is a bug in the model weights but there are bugs in detecting various outputs as identified in the PR I linked.

by sosodev

2/19/2026 at 1:59:36 PM

Did you try an MLX version of this model? In theory it should run a bit faster. I'm hesitant to download multiple versions though.

by terhechte

2/19/2026 at 2:26:02 PM

Haven't tried. I'm too used to llama.cpp at this point to switch to something else. I like being able to just run a model and automatically get:

- OpenAI completions endpoint

- Anthropic messages endpoint

- OpenAI responses endpoint

- A slick looking web UI

Without having to install anything else.

by tarruda

2/19/2026 at 5:02:54 PM

Is there a reliable way to run MLX models? On my M1 Max, LM Studio seems to output garbage through the API server sometimes even when the LM Studio chat with the same model is perfectly fine. llama.cpp variants generally always just work.

by KerrAvon

2/19/2026 at 3:14:03 PM

gpt-oss 120b and even 20b works OK with codex.

by lostmsu

2/19/2026 at 4:07:25 PM

Both gpt-oss are great models for coding in a single turn, but I feel that they forget context too easily.

For example, when I tried gpt-oss 120b with codex, it very easily forgets something present in the system prompt: "use `rg` command to search and list files".

I feel like gpt-oss has a lot of potential for agentic coding, but it needs to be constantly reminded of what is happening. Maybe a custom harness developed specifically for gpt-oss could make both models viable for long agentic coding sessions.

by tarruda

2/19/2026 at 4:25:30 PM

Loved reading the reasoning[0] for the recent "Walk or drive to the carwash" trick.

[0] https://gist.github.com/lm2s/c4e3260c3ca9052ec200b19af9cfd70...

Not sure if it's directly accessible, but here's the link: https://stepfun.ai/chats/213451451786883072

by lm2s

2/19/2026 at 10:11:47 AM

> 51.0% on Terminal-Bench 2.0, proving its ability to handle sophisticated, long-horizon tasks with unwavering stability

I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks

by anentropic

2/19/2026 at 12:38:10 PM

51% doesn't tell you much by itself. Benchmarks like this are usually not graded on a curve and aren't calibrated so that 100% is the performance level of a qualified human. You could design a superhuman benchmark where 10% was the human level of performance.

Looking at https://www.tbench.ai/leaderboard/terminal-bench/2.0, I see that the current best score is 75%, meaning 51% is ⅔ SOTA.

by networked

2/19/2026 at 2:07:00 PM

This is interesting, TFA lists Opus at 59. Which is the same as Claude Code with opus on the page you linked here. But it has Droid agent with Opus scoring 69. Which means the CC harness harness loses Opus 10 points on this benchmark.

I'm reminded of https://swe-rebench.com/ where Opus actually does better without CC. (Roughly same score but half the cost!)

by andai

2/19/2026 at 11:59:28 AM

That score is on par with Gemini 3 Flash but these scores look much more affected by the agent used than the model, from scrolling through the results.

by pitched

2/19/2026 at 12:24:39 PM

Gemini 3 Flash is pure rubbish. It can easily get into loop mode and spout information no different than Markov chain and repeat it over and over.

by varispeed

2/19/2026 at 12:51:46 PM

TerminalBench is like the worst named benchmark. It has almost nothing to do with terminal, but random tools syntax. Also it's not agentic for most tasks if the model memorized some random tool command line flags.

by YetAnotherNick

2/19/2026 at 2:58:21 PM

What do you mean? It tests whether the model knows the tools and uses them.

by esafak

2/19/2026 at 3:47:08 PM

Yeah it's a knowledge benchmark not agentic benchmark.

by YetAnotherNick

2/19/2026 at 3:54:03 PM

That's like saying coding benchmarks are about memorizing the language syntax. You have to know what to call when and how. If you get the job done you win.

by esafak

2/19/2026 at 4:07:50 PM

I am saying the opposite. If a coding benchmark just tests the syntax of a esoteric language, it shouldn't be called coding benchmark.

For a benchmark named terminal bench, I would assume it would require some terminal "interaction", not giving the code and command.

by YetAnotherNick

2/19/2026 at 5:05:38 AM

Hallucinates like crazy. use with caution. Tested it with a simple "Find me championship decks for X pokemon", "How does Y deck work". Opus 4.6, Deepseek and Kimi all performed well as expected.

by danieltanfh95

2/19/2026 at 2:59:26 PM

I would use a medium-sized model for execution not its knowledge.

by esafak

2/19/2026 at 9:21:55 AM

I mean, is it possible the latter models used Search? Not saying Stepfun's perfect (it is not.) Gemini especially and unsurprisingly uses search a lot and it is ridiculously fast, too.

by mickeyp

2/19/2026 at 2:33:16 AM

Recent model released a couple of weeks ago. "Mixture of Experts (MoE) architecture, it selectively activates only 11B of its 196B parameters per token". Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.

Edit: there are 4 bit quants that can be run on an 128GB machine like a GB10 [1], AI Max+ 395, or mac studio.

[1] https://forums.developer.nvidia.com/t/running-step-3-5-flash...

by kristianp

2/19/2026 at 9:52:09 AM

> Beats Kimi K2.5 and GLM 4.7 on more benchmarks than it loses to them.

Does this really mean anything? I for example, tend to ignore certain benchmarks that are focused towards agentic tasks because that is not my use case. Instruction following, long context reasoning and non-hallucinations has more weight to me.

by Alifatisk

2/19/2026 at 12:56:01 PM

Q4_K_S @ 116 GB

IQ4_NL @112 GB

Q4_0 @ 113 GB

Which of these would be technically better?

[1] https://huggingface.co/bartowski/stepfun-ai_Step-3.5-Flash-G...

by mycall

2/19/2026 at 5:04:42 PM

of those, Q4_K_S is better

by KerrAvon

2/19/2026 at 8:21:47 AM

It's nice to see more focus on efficiency. All the recent new model releases have come along with massive jumps in certain benchmarks but when you dig into it it's almost always paired with a massive increase in token usage to achieve those results (ahem Google Deep Think ahem). For AI to truly be transformational it needs to solve the electricity problem

by culi

2/19/2026 at 8:48:41 AM

And not just token usage, expensive token usage; when it comes to tokens/joule not all tokens are equal. Efficient use of MoE architectures does have an impact on tokens/joule and tokens/sec.

by tankenmate

2/19/2026 at 12:05:12 PM

I like the intelligence per watt and intelligence per joule framing in https://arxiv.org/abs/2511.07885 It feels like a very useful measure for thinking about long-term sustainable variants of AI build-outs.

by mzl

2/19/2026 at 9:28:33 AM

SWE-bench Verified is nice but we need better SWE benchmarks. Making a fair benchmark is a lot of work and a lot of money needed to run it continuously.

Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.

The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.

by mohsen1

2/19/2026 at 9:55:30 AM

Help us out with Terminal Bench 3.0!

https://docs.google.com/document/d/1pe_gEbhVDgORtYsQv4Dyml8u...

by cbracketdash

2/19/2026 at 4:23:21 PM

In a quick test using a few of my standard test prompts a few observations: 1) the trace was very verbose and written in an odd style reminiscent of chat or those annoying one-sentence-per-paragraph LinkedIn posts; 2) token output rate is very high on the hosted version; 3) conformance to instructions and output quality was better than most of the leading models I've tested (e.g. Opus 4.5)

by hedgehog

2/19/2026 at 9:37:56 AM

Number of params isn’t really the relevant metric imo. Top models don’t support local inference. More relevant is tokens per dollar or per second.

by janalsncm

2/19/2026 at 10:03:34 AM

Its an open source model, why wouldn't it be relevant for people who want to self host.....

by dakolli

2/19/2026 at 5:16:22 PM

This one is open weights but comparing to Gemini/Claude etc. on number of params isn’t relevant outside of a research context imo. Users don’t care how many params Gemini has as long as it’s fast and cheap.

by janalsncm

2/19/2026 at 1:54:52 PM

Number of parameters is at least a proxy for model capability.

You can achieve incredible tok/dollar or tok/sec with Qwen3 0.6b.

It just won't be very good for most use cases.

by qeternity

2/19/2026 at 5:19:52 PM

Model capability is the other axis on their chart. So they could have put Qwen 0.6b there, it would be in the bottom right corner.

I know what they are trying to do. They are attempting show a kind of pareto frontier but it’s a little awkward.

by janalsncm

2/19/2026 at 11:24:15 AM

It does since you can run this model locally on a < $3k machine

by lm28469

2/19/2026 at 3:35:33 AM

That reverse x axis sure is confusing.

by wmf

2/19/2026 at 5:03:21 PM

Was going to comment the same thing... Not sure what they are thinking there.

by __mharrison__

2/19/2026 at 6:02:23 AM

I imagine they thought they'd look better this way. I don't think they do.

by esafak

2/19/2026 at 12:21:22 PM

I’ve been using this model for a while, and it’s very fast. It spent some time thinking but does fewer calls. For example, yesterday I asked the agent to find the Gemini quota limit for their API, and it took 27 seconds and just 2 calls, Opus 4.6 took 33 seconds, but 5 calls with less thinking

by tallesborges92

2/19/2026 at 10:48:38 AM

Holy moly, I made a simple coding promt and the amount of reasoning output could fill a small book.

> create a single html file with a voxel car that drives in a circle.

Compared to GLM 4.7 / 5 and kimi 2.5 it took a while. The output was fast, but because it wrote so I had to wait longer. Also output was .. more bare bones compared to others.

by Mashimo

2/19/2026 at 3:21:46 PM

That's how it compensates for its small size. To accomplish a task of certain difficulty either you know more and think less, or vice versa.

by esafak

2/19/2026 at 11:16:18 AM

That's been my experience as well. Huge amounts of reasoning. The model itself is good but even if you get twice as many tokens as with another model, the added amount of reasoning may make it slower in the end.

by Tepix

2/19/2026 at 9:43:57 AM

Interesting.

Each time a Chinese model makes the news, I wonder: How come no major models are coming from Japan or Europe?

by prmph

2/19/2026 at 10:28:26 AM

You would be surprised to see how much the japanese IT industry is behind the times (a decade at least IMO). There is only a very limited startup culture here (both in size and talentpool and business ideas), there is no real risk taking venture capital market here (maybe Masayoshi Son is the exception here, but again he tends to invest in the US mostly) and most software companies use very very very outdated management practices. On top of that most software development had been/has been outsourced to India, Vietnam, China, etc, so management see no value in software talent... SW engineers' social recognition here are mostly on the level of accountants. Under such circumstances japan will never have a chance to contribute to AI meaningfully (other than niche academic research)

by rester324

2/19/2026 at 5:10:00 PM

Seems like the Japanese have had this major blind spot in software engineering since the 90's. Even Sony didn't bother to use what they learned from the PlayStations to produce their own TV OS, outsourcing it to Google. It's as if the 5th generation stuff not working out just burned out that circuit in Japan entirely.

by KerrAvon

2/19/2026 at 11:59:04 AM

1. The US and China are two biggest economies by GDP. 2. The US is the default destination for worldwide investors (because of historically good returns). China has huge state economy and the state can direct investments into this area.

by citrin_ru

2/19/2026 at 3:20:45 PM

EU's GDP is higher than China's

by lostmsu

2/19/2026 at 9:48:31 AM

Have you heard of Mistral? I would consider Mistral major, albeit not frontier.

by jstummbillig

2/19/2026 at 8:55:34 PM

Mistral, huggingface, etc. And things change fast in this field.

by victorbjorklund

2/19/2026 at 4:31:24 PM

Model development takes a massive amount of capital. As far as I can tell, capital in Europe is a lot more risk-averse than in other locales.

by Balinares

2/19/2026 at 11:17:12 AM

The Koreans have released some good models lately. And Mistral is also release open weights models that aren't too shabby.

by Tepix

2/19/2026 at 2:41:38 PM

https://www.businessinsider.com/openclaw-creator-slams-europ...

Europe is a bad place to try and be successful in tech.

by WarmWash

2/19/2026 at 11:49:22 AM

Have you heard of Pleias ? Their SML baguettotron is blazingly fast, and surprisingly good at reasoning (but it's not programming-oriented).

by wazoox

2/19/2026 at 5:37:42 PM

Actually there is even a straight connection: Step-Fun DeepResearch trained on SYNTH (the open Baguettotron dataset).

by Dorialexander

2/19/2026 at 10:24:59 AM

Cause Europe only good at writing fines for other tech companies

by tonis2

2/19/2026 at 10:54:47 AM

Does it pass the carwash test?

by amelius

2/19/2026 at 12:11:03 PM

Yes it did well. Also some other word problems it did well too. Reasoning seems good. But maybe not a great code model

by earth2mars

2/19/2026 at 11:43:32 AM

What's that?

by oblio

2/19/2026 at 11:53:39 AM

https://news.ycombinator.com/item?id=47031580

by amelius

2/19/2026 at 4:53:41 AM

So who exactly is StepFun? What is their business (how do they make money)? Each time I click “About Stepfun” somewhere on their website, it sends me to a generic landing page in a loop.

by SilverElfin

2/19/2026 at 8:59:29 AM

They've been around a couple years. This is the first model that has really broken into the anglosphere.

Keep a tab on aihubmix, the Chinese openrouter, if you want to stay on top of the latest models. They keep track of things like the Baichuan, Doubao, baai (beijing academy), Meituan, 01.AI (yi), xiaomi, etc...

Much larger chinese coverage than openrouter

by kristopolous

2/19/2026 at 12:48:54 PM

> This is the first model that has really broken into the anglosphere.

Before Step 3.5 Flash, I've been hearing a lot about ACEStep as being the only open weights competitor to Suno.

by tarruda

2/19/2026 at 9:26:42 AM

>first model that has really broken into the anglosphere.

Do you know of a couple of interesting ones that haven't yet?

by Havoc

2/19/2026 at 9:31:56 AM

doubao (bytedance) seed models are interesting

Keep your eye on Baidu's Ernie https://ernie.baidu.com/

Artificial analysis is generally on top of everything

https://artificialanalysis.ai/leaderboards/models

Those two are really the new players

Nanbeige which they haven't benchmarked just put out a shockingly good 3b model https://huggingface.co/Nanbeige - specifically https://huggingface.co/Nanbeige/Nanbeige4.1-3B

You have to tweak the hyper parameter like they say but I'm getting quality output, commensurate with maybe a 32b model, in exchange for a huge thinking lag

It's the new LFM 2.5

by kristopolous

2/19/2026 at 12:38:25 PM

Never heard of Nanbeige, thanks for sharing. "Good" is subjective though, in which tasks can I use it and where to avoid?

by admiralrohan

2/19/2026 at 12:40:46 PM

it's a 3b model. Fire it up. If you have ollama just do this:

    ollama create nanbeige-custom -f <(curl https://day50.dev/Nanbeige4.1-params.Modelfile)

That has the hyperparameters already in there. Then you can try it out

It's taking up like 2.5GB of ram.

my test query is always "compare rust and go with code samples". I'm telling you, the thinking token count is ... high...

Here's what I got https://day50.dev/rust_v_go.md

I just tried it on a 4gb raspberry pi and a 2012 era x230 with an i5-3210. Worked.

It'll take about 45 minutes on the pi which you know, isn't OOM...so there's that....

by kristopolous

2/19/2026 at 1:05:08 PM

Thanks!

by Havoc

2/19/2026 at 5:12:48 AM

https://en.wikipedia.org/wiki/StepFun

by 0x1997

2/19/2026 at 5:42:19 AM

Thanks. Do they sell any of these products today or is it more like research? I am not able to find anything relating to pricing on their website. Just a chatbot.

by SilverElfin

2/19/2026 at 6:35:23 AM

Princing can be found on their docs website https://platform.stepfun.ai/docs/en/pricing/details

by 0x1997

2/19/2026 at 12:47:30 PM

They seem to be the same company that released ACEStep music generation model: https://acestep.io/

Though the only mention I found was in ComfyUI docs: https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1

by tarruda

2/19/2026 at 5:01:39 AM

Might want to give it a search.

by deaux

2/19/2026 at 9:33:09 AM

Works impressively well with pi.dev minimal agent.

by sinenomine

2/19/2026 at 3:35:42 PM

Any pelicans from non-quantized variants?

by lostmsu

2/19/2026 at 6:37:01 AM

what country is behind this one ?

by agentifysh

2/19/2026 at 6:48:23 AM

Step 3.5 Flash was made by Chinese company StepFun - https://en.wikipedia.org/wiki/StepFun

by personalcompute

2/19/2026 at 2:04:07 PM

[dead]

by octoclaw