Why eval startups fail (2025)

6/24/2026 at 6:32:03 PM

I have written a couple of eval harnesses to see how well LLMs drive software I have written. Basically I have data analysis software that I need LLMs to write code for. The code is complex, and I want to shape my APIs such that LLMs do a better job of quickly getting to the right answer. So I test different prompting and api surfaces, it's really easy to make quick gains this way and save your users from bugs. In this paradigm, I'm explicitly not testing different models, and I'm very interested to see how lesser models do with my software. Also for this type of testing, using the open weight models makes it faster, cheaper, and more reliable to test vs frontier models because I can trust that kimi-2.5-a-bunch-of-specs is going to behave more consistently than whatever tweaks Claude is making to Sonnet this week. API and prompting improvements seem to carry across the different models for gross improvements.

I haven't looked that hard, but I can't find articles about this type of eval testing, curious to hear if others have approached writing APIs in this way.

by paddy_m

6/24/2026 at 10:50:41 AM

I built a simple (free) eval tool for my own uses (Github Gists + Model Outputs) after not being able to find a suitable one in the market.

The market's being split into

1. Longitudinal LLM observability tooling

Most eval startups have gone down the route of something more like being an observability platform for LLM inference. They want to be in your stack and running the inference to collect data on performance of it.

They collect things like how often a model returns JSON that's out of spec or returns values that aren't expected as well as general timing and cost info.

2. Safety Limiting / Pentesting

Say you're doing something in the medical field or that's sensitive in some way and you want to figure out what model has the best outputs for your task that won't fly off the guardrails.

3. Simple cost + performance + quality swapping

This is what my tool does, basically lets you test if you _really_ need to be running that frontier model in a loop across a million records or if you'd be better with an older model or something else.

https://evvl.ai/

Example eval: https://giyd8stidy.evvl.io

by michaelbuckbee

6/24/2026 at 1:07:08 PM

Cool project! I haven't seen that OpenRouter workflow yet (sign into OpenRouter and it creates an API key that your app can use), that looks like an interesting pattern to investigate.

My company recently built a tool that is closer to your first category, but it's an API so it doesn't have the security (supply chain) concern of being embedded in your application.

https://endpointevaluator.com

It's built to help people manage the risk of LLMs changing underneath them and drifting from their designed behavior. Traditional deterministic testing probably won't be sufficient for apps that provide nondeterministic output, like a chatbot backed by an LLM.

The point in the linked article about the challenge of selling developer tools to developers is a good one. I think the first reaction to coding agents is "let's build everything ourselves!" but the long tail of maintenance is still there and the pendulum will probably swing back to "let's stick to our knitting."

by gavinboston

6/24/2026 at 12:41:55 PM

[flagged]

by jimmypk

6/24/2026 at 6:25:09 PM

I can see where Goodhart's Law applies to psychology and economics, pretty much any man-made domain without IDLH (immediate-danger-to-life-and-health) outcomes. But I think it's going to be hard to Goodhart a lot of medical AI safety. Biology doesn't give a shit.

However, identifying the right metrics and having the necessary test sets will, at times, be challenging.

by 0xWTF

6/24/2026 at 8:49:52 AM

What's an eval?

by theteapot

6/24/2026 at 9:31:00 AM

(Author) It's short for "evaluation", a test for an AI model. Specifically, an AI evaluation comprises (1) a dataset of prompts (as questions / tasks / queries), (2) some way to score model performance on each prompt, like a set of correct answers or a grading rubric that you can use with an LLM autograder, and (3) a metric, such as accuracy¹. (If you're already familiar with the term "benchmark", it's the same thing; for some reason the former has become the term of art in the past few years).

For example, a simple eval is a dataset of multiple-choice questions, which each have one correct answer, and scored by accuracy. An example of this kind of eval is the Massive Multitask Language Understanding benchmark (2020) (https://arxiv.org/abs/2009.03300).

A more complex eval is FrontierCode (2026). Questions in FrontierCode represent coding tasks needed for real-world repos and are evaluated against rubrics scoring for correctness, code quality, cleanliness, and other factors. https://cognition.com/blog/frontier-code.

¹Note that this is a slightly different definition we used in [0], which used a definition of a fixed input-output correspondence pairs combined with a metric. What's different from 2021 is: models are now given more open-ended inputs (prompts like "find the bug" and a codebase rather than a simple question), have freeform generation (rather than choosing a fixed answer), and are graded in a more complex manner (e.g. beyond correctness, one might care for a coding eval also to grade adherence to coding guidelines, test coverage, etc).

[0] Liao, T., Taori, R., Raji, I. D., & Schmidt, L. (2021, January). Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). https://thomasliao.com/are_we_learning_yet.pdf

by thomasliao

6/24/2026 at 10:21:35 AM

Would've started the article out alluding to this, or added a tooltip or something to this effect

by jorisw

6/24/2026 at 1:33:08 PM

So basically an "eval startup" is somewhere between a middleman or a Consumer Reports?

by gwbas1c

6/24/2026 at 12:21:21 PM

That sounds a bit weak as a startup idea. Hard to productize, hard to scale, etc. It sounds more like consulting.

by jillesvangurp

6/24/2026 at 7:33:27 PM

Yah - it's simultaneously the hardest and most vital part of machine learning work, and completely non-monetizable.

by sdenton4

6/24/2026 at 8:54:29 AM

Evaluations of different implementations of a tech. Kind of like a meta service layer on top of an industry, such as "Which frontier model is best?"

I do agree that the author does not do a good job of introducing the term.

by choult

6/24/2026 at 9:25:56 AM

"Which frontier model is best?"

What kind of stupid business is this. Though nothing can beat SEO in that spirit.

by wseqyrku

6/24/2026 at 9:34:05 AM

It's an important question! If you are paying a lot of money to use AI models, you care that you are using the best for your task. And it turns out that figuring out which AI models is best for your task is not trivial and requires some expertise.

by thomasliao

6/24/2026 at 9:48:49 AM

That was too nice of a reply, I apologize. I just can't understand the thought process and that what exactly are we optimizing for? If you are paying a lot of money to use AI models, you already have so much overhead that precise ranking in an eval is not gonna make much difference between equally "frontier" models. Especially since models are sensitive to the input. So the eval is just gonna evaluate the eval with very high accuracy. It might be equivalent to the illusion of safety thing applied to financial risk.

by wseqyrku

6/24/2026 at 7:32:39 PM

There is a larger question of "do I need a frontier model for this" - knowing the cost/benefit tradeoff using frontier vs. e.g. local models is extremely valuable and takes skill to do!

by jmalicki

6/24/2026 at 10:08:35 AM

>equally "frontier" models

A key point I want to make is that the notion of "frontier" is somewhat fictive in the sense that a model which dominates all others on a given eval is not guaranteed to be number one on another eval, even if both evals are ostensibly for the same task.

For example, the best publicly-available model (i.e. excluding Claude Mythos and Fable) on DeepSWE[0] is gpt-5.5-xhigh at 67%, which is soundly better than claude-opus-4.8-max at 59%. I would say an 8pp gap on a benchmark is quite large. But on FrontierCode[1], claude-opus-4.8-xhigh is the best, at a score of 13.4% compared to gpt-5.5-medium at 6.3%.

That's quite a significant reversal!

Now, one might wish to claim that either DeepSWE or FrontierCode is poorly constructed and that the other is more accurate. But I think you'll find that the degree to which eval-design considerations in this case affect measurement is probably of no less magnitude than user-specific considerations affect measurement in general.

[0] https://deepswe.datacurve.ai/ [1] https://cognition.com/blog/frontier-code

by thomasliao

6/24/2026 at 11:08:07 AM

It's not just figuring out if a model is good at things, but is it good at the things I care about.

Using a targeted eval suite (like a test suite) tells us that.

by unchar1

6/24/2026 at 10:03:02 AM

It's not just for choice of model, you can use it for your prompting as well (basically anything to do with your setup). And yes, running evals is expensive and mostly of use to people with serious spend.

by moomin

6/24/2026 at 12:36:06 PM

They all change day to day and are non-deterministic by design. Your settled answer is only good for a moment.

by liveoneggs

6/24/2026 at 11:47:20 AM

But frontier models are constantly changing.

by lupire

6/24/2026 at 11:50:48 AM

To complement the excellent answers that I read in this thread: an eval is a test.

What makes it particular for the case of AI is:

- there are many situations where you can’t test using pattern matching

- you don’t only like to test correct answers but voice and tone too (imagine a bank support LLM-based chatbot that answers using slang)

- evals can be used to compare the performance of different implementations; given the costs of LLMs, it’s very important

- running evals is more expensive than running standard tests, because you rely on the LLM calls under test, and many times they use LLMs as a judge. It means that running them in every commit of your CI/CD is very expensive

- Knowing all the possible inputs for the LLM is impossible, so evals can also be run on runtime samples to detect anomalies

by diegof79

6/24/2026 at 10:38:15 AM

IMHO - In an AI context an "eval" is answering the question - "Is this AI / LLM call helping me or is doing the right thing?"

AI is not deterministic like regular code, so imagine you use it for "search" (RAG) or for summarizing or for classifying emails etc. How do you know it is giving you the right results? In this context, AI evals are an important idea and very often neglected.

You can use an initial "dataset" to evaluate your prompt and AI calls + code (think test cases), this dataset will of-course be curated by humans. But as the software is used, you want to incorporate, real production data as well and run the evaluation pre and post launch. Sounds simple, but can get complicated specially since this area is new and as the post mentioned there are too many players and options out there (since everyone thought this is a money maker).

by rockyj

6/24/2026 at 6:18:01 PM

The current way benchmarks are done and are accepted by the community makes for really uninspired work. Until we're willing to break out of this rigid evaluation format prone to crazy overfitting and gaming, talent will move elsewhere. It is kind of a chicken and egg problem though.

by dippogriff

6/24/2026 at 1:59:24 PM

"eval startups have a hard time finding customers, because clients have to be technical developers who want to build with APIs, but also not technical enough to run their own evals"

To add to this, they have to be developers who aren't already using a fullstack observability solution, since it's fairly straight forward to add the eval startup featureset to an existing observability solution, and easier (plus cost effective) to just keep it all in one place.

by dbish

6/24/2026 at 7:22:21 PM

Interesting. As someone who has written a fair amount of evals to test and benchmark my skills and tools, I am curious what observability solutions you might be referring to? /me heads off to ddg…

by steinnes

6/24/2026 at 4:13:54 PM

The way eval startup is defined here is very specific and doesn't cover successful eval farmwork/SaaS vendors like Arize, Promptfoo, deepeval, etc

The author does have a point around generic benchmarks not being super valuable for companies. But evals should be seen as verifying design/behaviour constraints and can greatly aid product building, golden dataset creations and good software practices.

It's just that the aim should be "how to generate your own good evals, even if it's hard" as not so much "here's some generic evals about models".

by alexhans

6/24/2026 at 12:40:12 PM

Unfortunately, model quality is not the only criterion for users, and often not even the most important one. Adoption is also driven by marketing, UX, integrations, pricing, ecosystem, and a lot of other non-benchmark factors.

Also, model providers are not interested to have their models compared head-to-head under identical conditions. And “Model A is better than Model B” is almost meaningless by itself. Better for what task? With what prompt? What inputs? What budget? What failure tolerance?

It would be nice to have a place where users could run their own benchmarks, define evaluation criteria for their actual use cases, and make those runs verifiable by others.

by PashaGo

6/24/2026 at 1:43:36 PM

I’m convinced the only way to make a startup work, with a few exceptions, is to give away your product or sell below cost.

For years upon years until you get brought out. Then it’s someone else’s problem. Or you IPO and bring in new management to figure out how to make money.

VCs don’t see 20x exits happening for Eval companies, so they have trouble with the losing money for years step

by 999900000999

6/24/2026 at 10:24:50 AM

I think there's gonna be (or perhaps already is) a huge demand for evaling individual systems. Many countries are starting to adopt some criteria for LLM usage for public use, and I doubt govs are gonna develop in-house knowhow for this. These will likely form some kinds of "independent auditor" models, as the system provider has too strong conflicts of intetest.

It's probably not gonna be exactly glorious work, but designing expert evals settings and collecting and crunching the data for quality assurance and control is going to be needed.

by jampekka

6/24/2026 at 8:59:26 AM

The problem with eval is the fact that the information is not updating itself fast enough so that you want the latest model performance benchmarks. Bloomberg succeeded because it sells info that is expires in the next hour.

by GL26

6/24/2026 at 10:14:11 AM

Imo it's very simple - AI is a big function inverter. If you have a better cost function than frontier labs, as in, you are better at judging model output quality, then you can use that cost function to RL the next generation of models.

Therefore your knowledge is better used in training than letting users be slightly better at the token casino. Which is mentioned in this post as well, eval startup people either go to work at frontier labs or finetune startups.

by torginus

6/24/2026 at 11:22:00 AM

Maybe it's not that valuable? No snark, but how much confidence do these evals provide?

by nilirl

6/24/2026 at 11:38:03 AM

Exactly this. I find most eval companies get torn in multiple directions and do not end up putting out useful data. Probably genuine value as a B2B/consulting style service but that quickly falls out of being a pure eval company.

by alansaber

6/24/2026 at 9:02:40 AM

If you look at the history of software engineering, the ones that made the most money were usually not the companies that built the applications themselves, but the ones that built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.

Personally, I agree with the Goodhart problem, but isn't the reason Eval startups fail because they try to sell an 'evaluation service' rather than a 'verification toolchain'? The problem, it seems, is that AI verification toolchains require a model in the end, because they internalize AI and sell it under the name of a 'harness.'

So an AI verification(eval) toolchain would have to be structurally different. Verifying AI code isn't about whether it compiles. AI code can always be made to compile. The issue involves various semantic criticisms, such as overfitting to existing designs and tests. To catch those issues, you ultimately need to build an AI. But building that AI is expensive. So in the end, AI verification companies depend on external model providers for the core components of their verification engine. I think this is a bad business decision

by jdw64

6/24/2026 at 9:07:59 AM

> made the most money

> built the tools to verify, deploy, and build them, such as CI/CD, static analysis tools, and testing frameworks.

Curious. Which company made money with testing frameworks?

by whinvik

6/24/2026 at 9:12:08 AM

I thought about mentioning Atlassian (Jira) and JetBrains, but come to think of it, they aren't really testing frameworks. They cover the entire development workflow overall. I guess I was thinking too short.

by jdw64

6/24/2026 at 10:23:33 AM

The "shovels for gold miners" analogy is generally a good one. It applies to Nvidia, for example. It doesn't generally apply to developers though. Developer tooling is notoriously difficult to monetize. Developers themselves are a shovel.

by noelwelsh

6/24/2026 at 12:52:57 PM

Devs are hard to market and sell too I've heard. It's likely because they can build a lot of the stuff out there themselves when pressed. They have the most app exposure so are opinionated. It's why most devs take the open source spoils while everyone else avoids GitHub in general. Although AI has made it easy to setup locally, many still don't see the value of controlling their software or ai agents fully like devs.

by brandensilva

6/24/2026 at 10:48:11 AM

Worked or tried to work for a few places that ended eval work in the 2010s for previous-gen systems. Most didn’t pay for it, thanks to all the ones that didn’t I didn’t dare try selling it to the one that would have.

by PaulHoule

6/24/2026 at 10:52:49 AM

evals are glorified integration tests, would you invest in an integration test startup? absolutely not. I don't get why we are making all of this fuzz around evals

by h1fra

6/24/2026 at 11:11:54 AM

Because what people actually want is a simple harness to test their use cases against all the frontier models and see which is the cheapest/best for the job.

It's simple to say but hard to master doing well, and the important thing is that no matter what tool you have the evals don't write themselves.

by hilariously

6/24/2026 at 12:30:30 PM

There are a number of integration test startups. None of them do a great job but they do exist.

by pydry

6/24/2026 at 8:56:05 AM

Everything eventually fails. Nothing is constant, not even evals.

by bitlad

6/24/2026 at 9:24:26 AM

Except regex, no matter how technologically advanced your company, somewhere someone is slapping regex on something that has no business being regexed.

by Etheryte

6/24/2026 at 9:26:52 AM

And llms seeing this keep on repeating that mistake, like trying to parse html with regexp.

by Asmod4n

6/24/2026 at 9:27:45 AM

You're in a business, and you think, to improve things I'm going to slap a regex on this. Now you're in two businesses.

by bryanrasmussen

6/24/2026 at 9:24:43 AM

> Not enough eval customers

Aha.

by wseqyrku

6/24/2026 at 10:24:56 AM

Because they operate on untrusted input

by coldtea

6/24/2026 at 2:32:32 PM

I found this pretty hard to read as the author has a very specific understanding of what an eval startup means but it is only implied rather than explicitly described. I would have thought that they were referring to the companies that provide a technology platform to enable you to do evals in an AI application context for example companies like Comet/Opik and Braintrust.

But it sounds like the author does not mean those companies at all since those are actually important in enabling the very Venn diagram he/she describes.

Based on what I assume the author's referring to they are referring to something more like a public benchmark report provider... I would say but yes that's a relatively small total addressable Market space no matter how you look at it

by redwood

6/24/2026 at 4:07:50 PM

Funnily enough, this made immediate sense to me, and I think it derives from being a situation where you need high reliability from a process, eg: I need a bot which has a 99.99% guarantee to not go out of bounds or say something incorrect.

by intended

6/24/2026 at 11:01:16 AM

[dead]

by gunaclksy