Without benchmarking LLMs, you're likely overpaying

1/20/2026 at 9:14:10 PM

Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.

- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N

Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps

Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition

by hamiltont

1/20/2026 at 9:21:12 PM

I use this approach for a ticket based customer support agent. There are a bunch of boolean checks that the LLM must pass before its response is allowed through. Some are hard fails, others, like you brought up, are just a weighted ding to the response's final score.

Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).

by pocketarc

1/21/2026 at 6:22:25 PM

Funny, this move is exactly what YouTube did to their system of human-as-judge video scoring, which was a 1-5 scale before they made it thumbs up/thumbs down in 2010.

by tomjakubowski

1/21/2026 at 6:59:30 PM

I hate thumbs up/down. 2 values is too little. I understand that 5 was maybe too much, but thumbs up/down systems need an explicit third "eh, it's okay" value for things I don't hate, don't want to save to my library, but I would like the system to know I have an opinion on.

I know that consuming something and not thumbing it up/down sort-of does that, but it's a vague enough signal (that could also mean "not close enough to keyboard / remote to thumbs up/down) that recommendation systems can't count it as an explicit choice.

by jorvi

1/21/2026 at 7:24:04 PM

Here's the discussion from back in the day when this changed: https://news.ycombinator.com/item?id=837698

In practice, people generally didn't even vote with two options, they voted with one!

IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.

by steveklabnik

1/21/2026 at 9:05:24 PM

> IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.

No, they got rid of them most likely because advertisers complained that when they dropped some flop they got negative press from media going "lmao 90% dislike rate on new trailer of <X>".

Stuff disliked to oblivion was either just straight out bad, wrong (in case of just bad tutorials/info) and brigading was very tiny percentage of it.

by PunchyHamster

1/22/2026 at 12:22:40 AM

Oh, didn't they remove the dislike count after people absolutely annihilated one of their yearly rewind with dislikes?

by rednafi

1/22/2026 at 6:26:55 PM

It was removed after some presidential speeches attracted heavy dislikes.

by direwolf20

1/22/2026 at 11:59:43 AM

The original sin is argued to be the Youtube Rewind 2018. But it took them until 2021 to roll it out.

by machomaster

1/22/2026 at 4:16:01 PM

well, people annihilated every of their rewinds with dislikes. But yeah, that might've contributed.

by PunchyHamster

1/22/2026 at 12:17:35 AM

YouTube never got rid of downvotes they just hid the count. Channel admins can still see it and it still affects the algorithm

by UltraSane

1/22/2026 at 1:34:21 AM

Youtube always kept downvotes and the 'dislike' button, the change (which still applies today) was that they stopped displaying the downvote count to users - the button never went away though.

Visit a youtube video today, you can still upvote and downvote with the exact same thumbs up or down, the site however only displays to you the count of upvotes. The channel owners/admins can still see the downvote count and the downvotes presumably still inform YouTube's algorithms.

by giobox

1/22/2026 at 12:02:24 PM

There is also an independent "Return Youtube Dislike" browser extension that shows the dislike numbers. It's very convenient.

by machomaster

1/22/2026 at 1:04:37 PM

That doesn't show the real number, only "a combination of scraped dislike stats and estimates extrapolated from extension user data."

by steveklabnik

1/22/2026 at 5:30:21 PM

I think that just the absence in official app and the existence of this tool makes this point largely irrelevant. Company in question could easily reverse this decision overnight as the data exist, but absent that people adjust to an available proxy estimate. It is interesting though, because it shows clear intent of "we don't want to show actual sentiment".

by iugtmkbdfil834

1/23/2026 at 12:55:43 AM

The official youtube stats (views, comments, upvotes) are not real/real-time either. But that's the best we have. And dislike numbers are in the same universe of credibility and closeness to reality. It's definitely good enough.

If you want downvote data be more precise, do your part and install the extension! :-)

by machomaster

1/20/2026 at 10:47:19 PM

How come accuracy has only 50% weight?

“You’re absolutely right! Nice catch how I absolutely fooled you”

by piskov

1/20/2026 at 9:27:31 PM

Yes, absolutely. This aligns with what we found. It seems to be necessary to be very clear on scoring (at least for Opus 4.5).

by lorey

1/20/2026 at 9:22:54 PM

This actually seems really good advice. I am interested how you might tweak this to things like programming languages benchmarks?

By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.

by Imustaskforhelp

1/20/2026 at 9:35:49 PM

Not sure I'm fully following your question, but maybe this helps:

IME deep thinking hgas moved from upfront architecture to post-prototype analysis.

Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging

With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate

When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.

The shift: from "design away problems" to "evaluate into solutions."

by hamiltont

1/20/2026 at 9:33:49 PM

Isn’t this just rubrics?

by 46493168

1/20/2026 at 11:28:28 PM

its a weighted decision matrix.

by 8note

1/20/2026 at 9:05:41 PM

Depends on what you’re doing. Using the smaller / cheaper LLMs will generally make it way more fragile. The article appears to focus on creating a benchmark dataset with real examples. For lots of applications, especially if you’re worried about people messing with it, about weird behavior on edge cases, about stability, you’d have to do a bunch of robustness testing as well, and bigger models will be better.

Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.

Id be careful is all.

by andy99

1/20/2026 at 9:09:01 PM

One point in favor of smaller/self-hosted LLMs: more consistent performance, and you control your upgrade cadence, not the model providers.

I'd push everyone to self-host models (even if it's on a shared compute arrangement), as no enterprise I've worked with is prepared for the churn of keeping up with the hosted model release/deprecation cadence.

by candiddevmike

1/21/2026 at 8:01:12 PM

Where can I find information on self-hosting models success stories? All of it seems like throwing tens of thousands away on compute for it to work worse than the standard providers. The self-hosted models seem to get out of date, too. Or there ends up being good reasons (improved performance) to replace them

by blharr

1/20/2026 at 9:22:59 PM

How much you value control is one part of the optimization problem. Obviously self hosting gives you more but it costs more, and re evals, I trust GPT, Gemini, and Claude a lot more than some smaller thing I self host, and would end up wanting to do way more evals if I self hosted a smaller model.

(Potentially interesting aside: I’d say I trust new GLM models similarly to the big 3, but they’re too big for most people to self host)

by andy99

1/20/2026 at 9:30:18 PM

You may also be getting a worse result for higher cost.

For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.

Still waiting on human evaluation to confirm the LLM Judge was correct.

by jmathai

1/20/2026 at 10:16:54 PM

That's interesting. Similarly, we found out that for very simple tasks the older Haiku models are interesting as they're cheaper than the latest Haiku models and often perform equally well.

by lorey

1/20/2026 at 9:52:19 PM

You obviously know what you’re looking for better than me, but personally I’d want to see a narrative that made sense before accepting that a smaller model somehow just performs better, even if the benchmarks say so. There may be such an explanation, it feels very dicey without one.

by andy99

1/23/2026 at 10:38:23 AM

You just need a robust benchmark. As long as you understand your benchmark, you can trust the results.

We have a hard OCR problem.

It's very easy to make high-confidence benchmarks for OCR problems (just type out the ground truth by hand), so it's easy to trust the benchmark. Think accuracy and token F1. I'm talking about highly complex OCR that requires a heavyweight model.

Scout (Meta), a very small/weak model, is outperforming Gemini Flash. This is highly unexpected and a huge cost savings.

Some problems aren't so easily benchmarked.

by vercaemert

1/21/2026 at 8:12:18 PM

Volume and statistical significance? I'm not sure what kind of narrative I would trust beyond the actual data.

It's the hard part of using LLMs and a mistake I think many people make. The only way to really understand or know is to have repeatable and consistent frameworks to validate your hypothesis (or in my case, have my hypothesis be proved wrong).

You can't get to 100% confidence with LLMs.

by jmathai

1/20/2026 at 9:30:54 PM

You're right. We did a few use cases and I have to admit that while customer service is easiest to explain, its where I'd also not choose the cheapest model for said reasons.

by lorey

1/20/2026 at 8:46:54 PM

I'd second this wholeheartedly

Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used

Cut all my subs, spend less money, don't get rate limited

by verdverm

1/20/2026 at 8:56:44 PM

Yeah, one of my first projects one of my buddies asked "Why aren't you using [ChatGPT 4.0] nano? It's 99% the effectiveness with 10% the price."

I've been using the smaller models ever since. Nano/mini, flash, etc.

by dpoloncsak

1/20/2026 at 9:12:51 PM

Yup.

I have found out recently that Grok-4.1-fast has similar pricing (in cents) but 10x larger context window (2M tokens instead of ~128-200k of gpt-4-1-nano). And ~4% hallucination, lowest in blind tests in LLM arena.

by sixtyj

1/20/2026 at 10:02:24 PM

You use stuff from xAi and Elmo?

I'm unwilling to look past Musk's politics, immorality, and manipulation on a global scale

by verdverm

1/20/2026 at 10:18:09 PM

Grok is the best general purpose LLM in my experience. Only Gemini is comparable. It would be silly to ignore it, and xAI is less evil than Google these days.

by rudhdb773b

1/22/2026 at 4:11:18 AM

When's the last time Sundar Pichai did a Hitler salute or had his creation calling itself "Mecha Hitler"?

by naught0

1/22/2026 at 3:25:04 PM

In the big picture, those events are insignificant compared to the negative impacts on society from Google's trillion dollar advertising business and the associated destruction of privacy.

by rudhdb773b

1/23/2026 at 3:26:23 AM

fair points, but we'll have to see now that grok is in the pentagon. sky's the limit

by naught0

1/20/2026 at 10:22:31 PM

[flagged]

by verdverm

1/20/2026 at 9:03:50 PM

I have been benchmarking many of my use cases, and the GPT Nano models have fallen completely flat one every single except for very short summaries. I would call them 25% effectiveness at best.

by phainopepla2

1/20/2026 at 10:23:34 PM

Flash is not a small model, it's still over 1T parameters. It's a hyper MoE aiui

I have yet to go back to small models, waiting for the upstream feature / GPU provider has been seeing capacity issues, so I am sticking with the gemini family for now

by verdverm

1/20/2026 at 9:04:59 PM

Flash Lite 2.5 is an unbelievably good model for the price

by walthamstow

1/20/2026 at 8:50:55 PM

Plus I've found that overall with "thinking" models, it's more like for memory, not even actual perf boost, it might even be worse because if it goes even slightly wrong on the "thinking" part, it'll then commit to that for the actual response

by r_lee

1/20/2026 at 10:03:51 PM

for sure, the difference in the most recent model generations makes them far more useful for many daily tasks. This is the first gen with thinking as a significant mid-training focus and it shows

gemini-3-flash stands well above gemini-2.5-pro

by verdverm

1/21/2026 at 9:06:52 PM

LLM bubble will burst the second investors figure out how much well managed local model can do

by PunchyHamster

1/22/2026 at 4:54:57 AM

Except that

1. There is still night and day difference

2. Local is slow af

3. The vast majority of people will not run their own models

4. I would have to spend more than $200+ a month on frontier AI to come close the same price it would cost for any decent AI at home rig. Why would I not use frontier models at this point?

by verdverm

1/20/2026 at 8:52:57 PM

[dead]

by dingnuts

1/22/2026 at 8:19:33 AM

All true, but from what I see in the field it is most often an "ain't nobody got time for that" as teams rush into adoption the costs be dammed for now. We'll deal with it only if cost becomes a major issue.

by PeterStuer

1/22/2026 at 1:00:00 PM

Haha, very true. Exactly as described in the article.

by lorey

1/20/2026 at 9:10:48 PM

Wow, this was some slick long form sales work. I hope your SaaS goes well. Nice one!

by gridspy

1/21/2026 at 6:08:59 PM

I love the user experience for your product. You're giving a free demo with results within 5 minutes and then encourage the customer to "sign in" for more than 10 prompts.

Presumably that'll be some sort of funnel for a paid upload of prompts.

by iFire

1/23/2026 at 8:48:12 AM

Wow - interesting how strong the differences are!

What seems missing: I can not see the answer from the different models. One have to rely on the "correctness" score.

Another minor thing: the scoring seems hardcoded to: 50% correctness, 30% cost, 20% latency - which is OK, but in my case i care more about correctness and latency I don't care.

Wow! This was my testprompt:

  You are an expert linguist and translator engine.  
  Task: Translate the input text from English into the languages listed below.  
  Output Format: Return ONLY a valid, raw JSON object.  
  Do not use Markdown formatting (no ```json code blocks).  
  Do not add any conversational text.
  
  Keys: Use the specified ISO 639-1 codes as keys.
  
  Target Languages and Codes:  
  - English: "en" (Keep original or refine slightly)  
  - Mandarin Chinese (Simplified): "zh"  
  - Hindi: "hi"  
  - Spanish: "es"  
  - French: "fr"  
  - Arabic: "ar"  
  - Bengali: "bn"  
  - Portuguese: "pt"  
  - Russian: "ru"  
  - German: "de"  
  - Urdu: "ur"
  
  Input text to translate:  
  "A smiling boy holds a cup as three colorful lorikeets perch on his arms and shoulder in an outdoor aviary."

by gforce_de

1/21/2026 at 6:34:06 PM

https://evalry.com/question-benchmarks/game-engine-assistant...

Here's a bug report, by switching the model group the api hangs in private mode.

by iFire

1/21/2026 at 6:44:35 PM

Headsup I think I broke the site.

by iFire

1/21/2026 at 6:48:39 PM

It's not you, it's the HN hug of death. There's so much load on the server, I'm barely able to download the redis image I need for caching...

by lorey

1/21/2026 at 6:44:17 PM

Thanks. Will take a look.

by lorey

1/21/2026 at 6:36:54 PM

I’m also collecting the data my side with the hopes of later using it to fine tuning a tiny model later. Unsure whether it’ll work but if I’m using APIs anyway may as well gather it and try to bottle some of that magic of using bigger models

by Havoc

1/22/2026 at 11:56:14 AM

This is useful when selecting a model for an initial application. The main issue I'm concerned about though is ongoing testing. At work we have devs slinging prompt changes left and right into prod, after "it works on my machine" local testing. It's like saying the words "AI" is sufficient to get rid of all engineering knowledge.

Where is TDD for prompt engineering? Does it exist already?

by xmcqdpt2

1/22/2026 at 12:55:01 PM

This is a very good point. When I came in, the founder did a lot of evaluation based on a few prompts and with manual evaluation, exactly as described. Showing the results helped me underline the fact that "works for me" (tm) does not match the actual data in many cases.

by lorey

1/22/2026 at 11:58:33 AM

Evals have always existed, and not using them when building systems is relying on superstition.

by cap11235

1/22/2026 at 12:58:59 PM

This is true with one caveat.

In most cases, e.g. with regular ML, evals are easy and not doing them results in inferior performance. With LLMs, especially frontier LLMs, this has flipped. Not doing them will likely give you alight performance and at the same time proper benchmarks are tricky to implement.

by lorey

1/21/2026 at 6:32:05 PM

I paid a total of 13 US Dollars for all my llm usage in about 3 years. Should I analyze my providers and see if there's room for improvement?

by dizhn

1/21/2026 at 8:28:44 PM

How? All LLM-as-a-Servive's are prohibitively expensive for me. $13 over 3 years sounds too-good-to-be-true.

by regenschutz

1/21/2026 at 9:53:52 PM

All local CLIs with free to use models. CLIs are opencode, iflow, qwen, gemini.

What I did splurge on was brief openai access for some subtitle translator program and when I used the deepseek api. Actually I think that $13 includes some as yet unused credits. :D

I'd be happy to provide details if CLIs are an option and you don't m ind some sweatshop agent. :)

(I am just now noticing I meant to type 2 years not 3 above. Sorry about that.)

by dizhn

1/21/2026 at 6:39:19 PM

Depends on your remaining budget ;)

by lorey

1/21/2026 at 10:14:06 PM

That is absolutely right. :)

by dizhn

1/21/2026 at 8:52:07 PM

I'm consistently amazed at how much some individuals spend on LLMs.

I get a good amount of non-agentic use out of them, and pay literally less than $1/month for GLM-4.7 on deepinfra.

I can imagine my costs might rise to $20-ish/month if I used that model for agentic tasks... still a very far cry from the $1000-$1500 some spend.

by wolttam

1/21/2026 at 9:35:56 PM

Doesn't this depend a lot on private vs company usage? There's no way I could spend more than a few hundreds alone, but when you run prompts on 1M entities in some corporate use case, this will incur costs, no matter how cheap the model usage.

by lorey

1/20/2026 at 10:37:25 PM

I do not disagree with the post, but I am surprised that a post that is basically explaining very basic dataset construction is so high up here. But I guess most people just read the headline?

by matusp

1/21/2026 at 6:01:17 PM

> it's the default: You have the API already

Sorry, this just makes no sense to start off with. What do you mean?

by tantalor

1/21/2026 at 6:06:42 PM

Fixed, thanks. Not a native speaker.

by lorey

1/20/2026 at 9:18:16 PM

This is just evaluation, not “benchmarking”. If you haven’t setup evaluation on something you’re putting into production then what are you even doing.

Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.

by deepsquirrelnet

1/20/2026 at 9:24:07 PM

What does that look like in your opinion, what do you use?

by andy99

1/20/2026 at 9:34:06 PM

This went straight to prod, even earlier than I'd opted for. What do you mean?

by lorey

1/20/2026 at 10:35:23 PM

I’m totally in alignment with your blog post (other than terminology). I meant it more as a plea to all these projects that are trying to go into production without any measures of performance behind them.

It’s shocking to me how often it happens. Aside from just the necessity to be able to prove something works, there are so many other benefits.

Cost and model commoditization are part of it like you point out. There’s also the potential for degraded performance because of the shelf benchmarks aren’t generalizing how you expect. Add to that an inability to migrate to newer models as they come out, potentially leaving performance on the table. There’s like 95 serverless models in bedrock now, and as soon as you can evaluate them on your task they immediately become a commodity.

But fundamentally you can’t even justify any time spent on prompt engineering if you don’t have a framework to evaluate changes.

Evaluation has been a critical practice in machine learning for years. IMO is no less imperative when building with llms.

by deepsquirrelnet

1/21/2026 at 4:11:45 PM

Aren't you supposed to customize the prompts to the specific models?

by ebla

1/21/2026 at 6:24:08 PM

I've skipped that in the article, but absolutely!

by lorey

1/20/2026 at 9:42:33 PM

You don't need a fancy UI to try the mini model first.

by OutOfHere

1/22/2026 at 1:00:40 PM

That is not what the article argues.

by lorey

1/20/2026 at 8:45:16 PM

> He's a non-technical founder building an AI-powered business.

It sounds like he's building some kind of ai support chat bot.

I despise these things.

by petcat

1/20/2026 at 9:06:56 PM

The whole post is just an advert for this person's startup. Their "friend" doesn't exist...

by montroser

1/20/2026 at 9:36:40 PM

Totally agree with your point. While I can't say specifically, it's a traditional (German) business he's doing vertically integrated with AI. Customer support is really bad in this traditional niche and by leveraging AI on top of doing the support himself 24/7, he was able to make it his competitive edge.

by lorey

1/20/2026 at 8:52:59 PM

And the whole article is about promoting his benchmarking service, of course.

by r_lee

1/20/2026 at 8:48:27 PM

[flagged]

by njhnjh

1/20/2026 at 8:56:10 PM

It's perfectly possible it's someone with deep domain experience, or someone who has product design or management skills. Regardless, dismissing these people out of pocket is not likely the best choice.

by sullivanmatt

1/20/2026 at 9:27:50 PM

ah yes... nothing like using another nondeterministic black box of nonsense to judge / rate the output of another.. then charge others for it.. lol

by nickphx

1/20/2026 at 9:46:46 PM

Amazon Bedrock Guardrails uses a purpose-built model to look for safety issues in the model inputs/outputs. While you won't get any specific guarantees from AWS, they will point you at datasets that you can use to evaluate the product and then determine if it's fit for purpose according to your risk tolerance.

by coredog64

1/20/2026 at 9:07:11 PM

The author of this post should benchmark his own blog for accessibility metrics, text contrast is dreadful..

On the other hand, this would be interesting for measuring agents in coding tasks, but there's quite a lot of context to provide here, both input and output would be massive.

by epolanski

1/20/2026 at 10:02:40 PM

Pushed a fix. Could you check, please?

Any resources you can recommend to properly tackle this going forward?

by lorey

1/20/2026 at 9:32:15 PM

Appreciate the feedback, will work on that.

by lorey

1/20/2026 at 11:10:23 PM

Do you have any insights on the platform evaluation for coding tasks?

by epolanski

1/20/2026 at 9:41:51 PM

One more vote on fixing contrast from me.

by faeyanpiraat

1/20/2026 at 9:43:48 PM

Will fix, thanks :)

by lorey

1/20/2026 at 10:02:19 PM

Tried Evalry, its a really nice concept, thanks for sharing it!

by faeyanpiraat