Ornith-1.0: self-improving open-source models for agentic coding

6/29/2026 at 6:24:04 PM

Previously: https://news.ycombinator.com/item?id=48709744

https://swelljoe.com/post/will-it-mythos/: "Poor performer here, only found the one bug that almost every model found, despite its performance on other benchmarks being excellent for its size. […] It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive."

by CharlesW

6/29/2026 at 7:56:25 PM

> It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive.

How is that a serious phrase in '26? I mean I have no idea if this fine-tune is good, haven't tried it, but testing a (clearly) agentic model without tool access and expecting it to work is crazy, no? What was he even testing?!

by NitpickLawyer

6/29/2026 at 8:47:31 PM

Last thing you want a model to do is hallucinate a tool call and it's outputs...

by nodja

6/29/2026 at 8:10:44 PM

Maybe expecting it to recognize it's limitation without tools instead of hallucinate. But yeah, not wholly useful. It's performance (and proclivity to hallucinations) with tools is what really matters.

by vikingcat

6/29/2026 at 9:07:22 PM

Visual Inspection Before Execution… it’s all vibe…

by reactordev

6/29/2026 at 11:09:53 PM

That benchmark ranks Kimi K2.6 and K2.7 Code near the bottom. Both are below Ornith 35B. It ranks Gemma 4 26B much higher than GLM-5.2. The results don't make much sense.

by juliangoldsmith

6/29/2026 at 8:05:25 PM

This is the first Qwen fine-tune that is not immediately rejected by the local LLM community, and in some cases even being recommended. Based on my limited usage, it is good, gives creative solutions to coding problems. I don't expect 9-35B models to one-click create full apps. Most people who were complaining did so .

by ricardobayes

6/29/2026 at 11:18:15 PM

The local LLM community is now teeming with erstwhile crypto and NFT hucksters who've brought the culture of hype from their former communities with them. There still are a few deeply technical people left, but their voices are being crowded out by the vapid marketers'.

by woadwarrior01

6/30/2026 at 12:32:24 AM

I've also noticed this. I wonder what causes the overlap. It can't be as simple as crypto and LLMs requiring the same hardware.

by S0y

6/30/2026 at 12:48:41 AM

They both feed off of hype. The people posting about crypto are not the same people gpu mining crypto so I wouldn’t chalk it up to the same hardware

by aardvarkr

6/30/2026 at 6:37:12 AM

Money.

by adastra22

6/30/2026 at 3:04:49 PM

/r/localllama is not like that at all.

by BoredomIsFun

6/30/2026 at 4:41:49 AM

There are certain influential people on Twitter, who if you see them start tweeting on a subject, you know the influx of hype and hucksters is coming.

by TheMagicHorsey

6/30/2026 at 12:38:55 PM

coughKarpathycough

by anon373839

6/29/2026 at 8:30:04 PM

> Most people who were complaining did so .

It has been this way since the beginning, unfortunately. There is certainly no harm in trying on local models on local workloads with modest guardrails.

Like most of these models (Qwen, Gemma, Llama, gpt-oss), finding all the little gotchas like, special tokens and prompt structure, model preference are a PITA right now. The reward are really nice models that run exceptionally well in agentic harnesses tuned with the prompts and parameters you fought so hard to learn.

by monkmartinez

6/29/2026 at 9:14:26 PM

Its not any better. Most of us at LocalLLama community dont like it except a few new people poping out and making posts.

by v3ss0n

6/29/2026 at 9:33:11 PM

Indeed, it performed worse than Qwen3.6-27b in my basic test.

It gave a fancier looking answer, but did a worse job following the prompt.

by gslepak

6/29/2026 at 9:37:43 PM

Roughly my experience so far; it trips up on itself a bit.

However, it's much more inclined to do web search unprompted, which is fascinating in its own way.

by dofm

6/29/2026 at 9:44:43 PM

> LocalLLama community

Ah, the place that shit on gpt-oss because it wasn't good at porn. That place is not what it used to be, hasn't been since that karpathy tweet, tbh. It's mostly slop and vibes nowadays.

by NitpickLawyer

6/29/2026 at 10:03:21 PM

and a lot of bots advertising a rename models like this one.

by v3ss0n

6/30/2026 at 4:26:39 AM

Where’s a good place to go instead nowdays?

by boredatoms

6/29/2026 at 8:37:26 PM

We must be in different communities... Qwen models are the most recommended ones that will actually run on local hardware that is accessible to the masses!

by arcanemachiner

6/29/2026 at 9:06:49 PM

Yeah, but they're talking about fine-tunes.

by montroser

6/30/2026 at 5:12:49 AM

From what I personally tested Ornith-1.0 35B is slightly better than Qwen-3.6 35B. My tests are tasks that consist of adding/modify feature in a big C++ codebase. The part that I find interesting is that the model is way faster than Qwen3.6 35B. It seems Ornith produce a smaller chain of thought. On my test it can be 3 time faster to produce the answer.

I use it via llamacpp and codex-cli.

by Narew

6/30/2026 at 12:59:02 PM

I've been testing Ornith-1.0 35B (my own FP8-block quant) and I like it. It runs at >200 tok/s w/ vLLM on an RTX PRO 6000 (sm120), I've run >140M cached tokens of agentic coding work on it over the past few days. It seems to about somewhere between Qwen 3.6 35B-A3B and 27B, but the good thing: it overthinks/doom-loop a lot less than Qwen 3.6. When looking at the thinking traces I like its breakdown approach template.

It does good job on basic analysis, tasks, and some front-end/backend changes on a medium-sized Go codebase, but it reached its limits totally botching a longer (simple) kernel implementation job (about 100 iterations in Pi Agent harness) - this is the type of thing that stronger open models (Kimi K2.6, GLM 5.2) are able to do.

by lhl

6/30/2026 at 4:17:03 PM

With this model size I've found that the harness seems to matter more. I've moved on to little-coder rather than raw pi with qwen3.6 27b personally, it might be worth taking a look.

by regularfry

6/29/2026 at 6:23:54 PM

Can anyone explain what’s the story here? Is this just a re-skinned qwen? Who is deepreinforce-ai and why isn’t this model listed on their website?

How does it self-improve, does the model change on disk - or just during a single context run it gets better?

by kennywinker

6/29/2026 at 6:29:00 PM

It doesn't self-improve, that's a misleading headline.

As far as I can tell they trained it by running their own reinforcement learning on top of Qwen and Gemma 4 (not sure how they combined weights from both, or if they used Qwen as the basis and Gemma 4 to help train?) - so the "self-improving" is about their training process, not how you use the weights.

by simonw

6/29/2026 at 6:45:04 PM

I think the 9b and 31b dense are Gemma models and the 35B-MoE, and 397B-MoE are Qwen models since these are model sizes covered by each of them respectively

by kamranjon

6/30/2026 at 9:13:22 PM

Only the 31b is Gemma.

All the rest - including 9B - are Qwen 3.5/3.6:

https://huggingface.co/deepreinforce-ai/Ornith-1.0-9B/blob/m...

by nacs

7/1/2026 at 11:19:31 AM

ah yeah you're correct - sorry for the confusion

by kamranjon

6/29/2026 at 9:17:07 PM

Do you think we will get a self-improving model in 26 or 27? Maybe not a native one but some kind of hack so a model will learn something without loosing part of the context window?

by sisve

6/29/2026 at 6:31:07 PM

Gotcha. That makes more sense. We ran the model to train the model -> “self-improving”.

by kennywinker

6/29/2026 at 9:15:29 PM

Clickbait title.

by v3ss0n

6/29/2026 at 7:53:43 PM

These are simply benchmaxxed versions of either Qwen or Gemma 4.

by S0y

6/29/2026 at 9:05:02 PM

If so, it's impressive they managed to benchmaxx Qwen even further than it's already benchmaxxed.

by 2001zhaozhao

6/29/2026 at 9:15:15 PM

Nah , they just put graphs with different color prioritizing themselves.

by v3ss0n

6/29/2026 at 8:11:04 PM

Citation needed

by jorisw

6/29/2026 at 9:43:58 PM

Sure. https://deep-reinforce.com/ornith_1_0.html

>Built on top of pretrained Gemma 4 and Qwen 3.5, it achieves state-of-the-art performance among open-source models of comparable size on coding benchmarks.

>Ornith-1.0 is a self-improving training framework. Instead of relying on human-designed harnesses to drive solution generation in RL, Ornith-1.0 learns to generate both solution rollouts and the task-specific harnesses that guide those rollouts.

by S0y

6/29/2026 at 11:16:02 PM

> the dense 9B fits on a single 80GB GPU

Us mere mortals cannot use this.

by giancarlostoro

6/30/2026 at 4:22:06 PM

Seems weird. A 9B model would normally fit unquantised on a 24GB GPU.

by regularfry

6/30/2026 at 5:05:15 AM

There are already quantizations available

by armarr

6/30/2026 at 12:53:38 PM

It would be nice to run a model that isn't quantized to death so it fits in 12GB of VRAM so I have room for reasonable context window, but also, this is ONE model in a set of models, the rest of the models need to run in a GPU cluster apparently.

by giancarlostoro

6/30/2026 at 10:15:47 PM

Why don't these 'self-improving' ones eventually improve to the point of being better that the bleeding edge?

by LoveMortuus

6/30/2026 at 11:21:08 AM

I've used a lot of local models and all of them felt like toys. This one actually felt useful. I hear Qwen 36-A3B is also good, yet to try that one.

by fareesh

6/29/2026 at 9:13:28 PM

Self-Improving bullshit. It is just Qwen 3.5 finetune benchmaxxed . Nothing spectacular . even fails at benchmarks. Long session tool calls sucks and hallucinate a lot with that too. Just use Qwen 3.6 and 3.5 122b.

by v3ss0n

6/30/2026 at 12:47:38 PM

Weird they talk about their 31B dense model but haven't actually released it anywhere.

by smcleod

6/30/2026 at 4:00:32 PM

Self-improving systems are exciting, but they also make provenance and governance much harder. Once agents can modify their own behavior over time, understanding why an agent behaved a certain way becomes increasingly important.

by GenseeAI

6/29/2026 at 8:12:49 PM

They keep mentioning a 31B dense model, but there are no benchmarks or weights for it anywhere?

by anana_

6/30/2026 at 4:45:43 AM

Glad to see more open models. However, where are the 31b models?

by RandyOrion

6/30/2026 at 6:46:56 AM

can the orniths self scaffolding could learn to scaffold the rlm loop?

by agenticup

6/30/2026 at 11:08:59 AM

[flagged]

by seanxx

7/1/2026 at 9:23:25 AM

[dead]

by modgate

6/29/2026 at 8:54:14 PM

[flagged]

by fratefritto

6/30/2026 at 6:09:42 AM

[flagged]

by modgate

6/30/2026 at 8:05:04 AM

[flagged]

by jkwang

6/30/2026 at 3:50:54 PM

[flagged]

by 1105714