6/29/2026 at 6:24:04 PM
Previously: https://news.ycombinator.com/item?id=48709744https://swelljoe.com/post/will-it-mythos/: "Poor performer here, only found the one bug that almost every model found, despite its performance on other benchmarks being excellent for its size. […] It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive."
by CharlesW
6/29/2026 at 7:56:25 PM
> It also performs poorly in a chat without tools, exhibiting an ehthusiasm for hallucination. I’m currently working on a replication of this with full tool access, including bash/Python, which may allow this model to be competitive.How is that a serious phrase in '26? I mean I have no idea if this fine-tune is good, haven't tried it, but testing a (clearly) agentic model without tool access and expecting it to work is crazy, no? What was he even testing?!
by NitpickLawyer
6/29/2026 at 8:47:31 PM
Last thing you want a model to do is hallucinate a tool call and it's outputs...by nodja
6/29/2026 at 8:10:44 PM
Maybe expecting it to recognize it's limitation without tools instead of hallucinate. But yeah, not wholly useful. It's performance (and proclivity to hallucinations) with tools is what really matters.by vikingcat
6/29/2026 at 9:07:22 PM
Visual Inspection Before Execution… it’s all vibe…by reactordev
6/29/2026 at 11:09:53 PM
That benchmark ranks Kimi K2.6 and K2.7 Code near the bottom. Both are below Ornith 35B. It ranks Gemma 4 26B much higher than GLM-5.2. The results don't make much sense.by juliangoldsmith