1/20/2026 at 9:14:10 PM
Anecdotal tip on LLM-as-judge scoring - Skip the 1-10 scale, use boolean criteria instead, then weight manually e.g.- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N
Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps
Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition
by hamiltont
1/20/2026 at 9:21:12 PM
I use this approach for a ticket based customer support agent. There are a bunch of boolean checks that the LLM must pass before its response is allowed through. Some are hard fails, others, like you brought up, are just a weighted ding to the response's final score.Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).
by pocketarc
1/21/2026 at 6:22:25 PM
Funny, this move is exactly what YouTube did to their system of human-as-judge video scoring, which was a 1-5 scale before they made it thumbs up/thumbs down in 2010.by tomjakubowski
1/21/2026 at 6:59:30 PM
I hate thumbs up/down. 2 values is too little. I understand that 5 was maybe too much, but thumbs up/down systems need an explicit third "eh, it's okay" value for things I don't hate, don't want to save to my library, but I would like the system to know I have an opinion on.I know that consuming something and not thumbing it up/down sort-of does that, but it's a vague enough signal (that could also mean "not close enough to keyboard / remote to thumbs up/down) that recommendation systems can't count it as an explicit choice.
by jorvi
1/21/2026 at 7:24:04 PM
Here's the discussion from back in the day when this changed: https://news.ycombinator.com/item?id=837698In practice, people generally didn't even vote with two options, they voted with one!
IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.
by steveklabnik
1/21/2026 at 9:05:24 PM
> IIRC youtube did even get rid of downvotes for a while, as they were mostly used for brigading.No, they got rid of them most likely because advertisers complained that when they dropped some flop they got negative press from media going "lmao 90% dislike rate on new trailer of <X>".
Stuff disliked to oblivion was either just straight out bad, wrong (in case of just bad tutorials/info) and brigading was very tiny percentage of it.
by PunchyHamster
1/22/2026 at 12:22:40 AM
Oh, didn't they remove the dislike count after people absolutely annihilated one of their yearly rewind with dislikes?by rednafi
1/22/2026 at 6:26:55 PM
It was removed after some presidential speeches attracted heavy dislikes.by direwolf20
1/22/2026 at 11:59:43 AM
The original sin is argued to be the Youtube Rewind 2018. But it took them until 2021 to roll it out.by machomaster
1/22/2026 at 4:16:01 PM
well, people annihilated every of their rewinds with dislikes. But yeah, that might've contributed.by PunchyHamster
1/22/2026 at 12:17:35 AM
YouTube never got rid of downvotes they just hid the count. Channel admins can still see it and it still affects the algorithmby UltraSane
1/22/2026 at 1:34:21 AM
Youtube always kept downvotes and the 'dislike' button, the change (which still applies today) was that they stopped displaying the downvote count to users - the button never went away though.Visit a youtube video today, you can still upvote and downvote with the exact same thumbs up or down, the site however only displays to you the count of upvotes. The channel owners/admins can still see the downvote count and the downvotes presumably still inform YouTube's algorithms.
by giobox
1/22/2026 at 12:02:24 PM
There is also an independent "Return Youtube Dislike" browser extension that shows the dislike numbers. It's very convenient.by machomaster
1/22/2026 at 1:04:37 PM
That doesn't show the real number, only "a combination of scraped dislike stats and estimates extrapolated from extension user data."by steveklabnik
1/22/2026 at 5:30:21 PM
I think that just the absence in official app and the existence of this tool makes this point largely irrelevant. Company in question could easily reverse this decision overnight as the data exist, but absent that people adjust to an available proxy estimate. It is interesting though, because it shows clear intent of "we don't want to show actual sentiment".by iugtmkbdfil834
1/23/2026 at 12:55:43 AM
The official youtube stats (views, comments, upvotes) are not real/real-time either. But that's the best we have. And dislike numbers are in the same universe of credibility and closeness to reality. It's definitely good enough.If you want downvote data be more precise, do your part and install the extension! :-)
by machomaster
1/20/2026 at 10:47:19 PM
How come accuracy has only 50% weight?“You’re absolutely right! Nice catch how I absolutely fooled you”
by piskov
1/20/2026 at 9:27:31 PM
Yes, absolutely. This aligns with what we found. It seems to be necessary to be very clear on scoring (at least for Opus 4.5).by lorey
1/20/2026 at 9:22:54 PM
This actually seems really good advice. I am interested how you might tweak this to things like programming languages benchmarks?By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.
by Imustaskforhelp
1/20/2026 at 9:35:49 PM
Not sure I'm fully following your question, but maybe this helps:IME deep thinking hgas moved from upfront architecture to post-prototype analysis.
Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging
With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate
When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.
The shift: from "design away problems" to "evaluate into solutions."
by hamiltont
1/20/2026 at 9:33:49 PM
Isn’t this just rubrics?by 46493168
1/20/2026 at 11:28:28 PM
its a weighted decision matrix.by 8note