6/26/2026 at 7:40:08 PM
I'm glad there are more attempts at solving model routing, as costs (at API rates) has really become an issue. Some feedback:1. Reiterate the cache issue from other comments already here. there is a lot of optimisation in harnesses around caching and a proxy model blows that up
2. Coding agents are model aware - they already route code discovery to mini / flash models, planning to heavy models, workflow design to ultra, implementation to mid / high etc. They know when they're exploring, planning, implementing, reviewing etc. and which model class to select and when it fails.
With a proxy you're breaking this control loop and feedback. It doesn't know, for ex. that it just attempted with deepseek v4 and it failed, lets try Opus?
3. How are you going to RL improvements and prevent the router becoming stale? You only have access to your own internal prompts and ~thousands of samples.
This is RL'd on one orgs codebase. There are going to be a lot of prompts you haven't seen before and have no insight to on how to route correctly, and you have no insight into users HF to improve your own model. Orgs aren't going to share their traces with you, so you need other sources to train on and improve
There are also new model releases every week that you need to keep up with - whats the story going to be here
4. Publish evals by running terminalbench / deepswe bench. Show us the performance / cost / time chart vs the other agent and model sets. If you can show gains there, you have a very simple value prop to sell where you can charge for a % of the saved costs
by nikcub
6/26/2026 at 8:07:22 PM
Really appreciate the thoughtful feedback!1. Agree it's important, fwiw the proxy model doesn't blow this up though - only incurs a 1 time cost when switching models and we're aware of that when making routing decisions
2. The agents are model aware yes but they are not incentivized to optimize too heavily here (in particular they don't use OS models even when they would be better). I think that's where this router comes in and brings genuine improvement.
3. Two parts here: 1 is continuing to grow our golden dataset over time, 2 is using reward signals from production traffic (on a per-customer basis or, if allowed, across all users)
4. Yes we have these internally, great callout that we should publish! Will do + will link from the repo soon. (Fwiw I think these benchmarks are useful but don't fully capture vibes - you should try it out yourself for that!)
by adchurch