6/24/2026 at 6:32:03 PM
I have written a couple of eval harnesses to see how well LLMs drive software I have written. Basically I have data analysis software that I need LLMs to write code for. The code is complex, and I want to shape my APIs such that LLMs do a better job of quickly getting to the right answer. So I test different prompting and api surfaces, it's really easy to make quick gains this way and save your users from bugs. In this paradigm, I'm explicitly not testing different models, and I'm very interested to see how lesser models do with my software. Also for this type of testing, using the open weight models makes it faster, cheaper, and more reliable to test vs frontier models because I can trust that kimi-2.5-a-bunch-of-specs is going to behave more consistently than whatever tweaks Claude is making to Sonnet this week. API and prompting improvements seem to carry across the different models for gross improvements.I haven't looked that hard, but I can't find articles about this type of eval testing, curious to hear if others have approached writing APIs in this way.
by paddy_m