Show HN: Agent-skills-eval – Test whether Agent Skills improve outputs

5/7/2026 at 3:19:02 PM

I'm skeptical skills will outperform training given that Opus 4.7 already ignores a 720-byte CLAUDE.md telling it to use tidewave (a Rails MCP server with 6 tools) for db queries. When I asked a new claude session about a record it called

> Bash(DATABASE_URL=$(grep -E '^DATABASE_URL=' .env 2>/dev/null | head -1) echo "ok")

even though I have in CLAUDE.md:

> For database queries, use tidewave first.

I then prompted:

> use tidewave as per CLAUDE.md. also diagnose why you failed to heed that

> ● Diagnosis first: I defaulted to shell habits (env grep → psql) instead of pausing to recall the CLAUDE.md rule that tidewave is the first-line DB tool. The trigger was "look at this record" — I should have read that as "run a SQL query" and reached for tidewave immediately.

If Opus 4.7 doesn't follow simple CLAUDE.md instructions, I'm not sure what benefits other markdown files could bring. I don't trust Opus's own explanation, but it could point to the fact that the system prompt for bash is much longer than CLAUDE.md with tidewave.

While LLM judging could be helpful, I think the tool-call assertions (https://github.com/darkrishabh/agent-skills-eval#what-you-ge...) may be the most useful thing in agent-skills-eval given that it's the only objective measure of compliance.

by reedlaw

5/7/2026 at 8:39:12 PM

I've had minor success with chiding the clanker, after it ignores something, to "please revise AGENTS.md to never do <whatever stupid thing it did> to prevent future assistances from doing x."

So, atleast heuristically, it should know _why_ it ignored whatever and hopefully pulls the correct anti-matter context. It took about two reptitions of this to get it to use pg-promise instead of psql to do queries for me. I assume the longer the context goes on, the less likely any of priming works.

by cyanydeez

5/7/2026 at 10:18:57 PM

Your using Claude code. That's your problem.

Use a different harness

by NamlchakKhandro

5/8/2026 at 12:44:18 PM

Codex is only slightly better, and that fluctuates so I switch back and forth.

by reedlaw

5/7/2026 at 4:31:57 PM

Use a hook

by erispoe

5/7/2026 at 5:11:48 PM

I tried to create a hook that would detect when token usage was running out and write HANDOFF.md so I could switch to another agent and finish the current task. It never worked reliably. To make a hook for db queries, it would need to run before each bash call, check if it looks like a query, and then exit with a new prompt, e.g.: "Use tidewave's execute_sql_query for DB access". But then it could just ignore the prompt the same as CLAUDE.me. What if I really wanted to use bash for a specific task? The real issue is that prompts are not tightly coupled with capabilities. If we admit that, then skills are over hyped.

by reedlaw

5/7/2026 at 4:49:59 PM

It's hard to make hooks work here, since the default approach it's using is call the URL directly.

I think it's better to have a repo-level skill instead, titled something like "connecting_to_db.md" and demonstrate exactly how to connect. Codex has been pretty good at referring to skills but it depends on context at the end of the day.

by rirze

5/7/2026 at 7:07:38 PM

[flagged]

by darkrishabh

5/7/2026 at 12:16:20 PM

Depending on skill, Claude already does this when creating new skills with their skill-creator skill (what a sentence), it's pretty neat. It creates ~6 subagents with and without the skill and judges if they differ in performance.

by ChairmanLmao

5/7/2026 at 3:02:39 PM

The claude provided skill-creator provides a decent jumping off point. It is easy enough to start with, but unless the skill is really simple I found it best to consider it a scaffold for building more tailored evals and reports.

The report leaves out a lot of detail. Several changes I found useful were: Pair with/without on same screen as left/right for easier viewing, token count for skill consumed, token used per run, time, pass rate, estimated cost, detailed aggregate stats, a parsed version of the conversation log (capturing the jsonl with each run, sometimes reading the log is the only way to find out why it's screwing up), work output logging (in my case screenshots and outputted script code), better formatting (syntax highlighting, log formatting).

Finally, I think the most useful thing was adding a self-reflection pass. After an eval is done, another agent looks at everything from that eval and tries to evaluate what went wrong along the way and what should be added to the skill, and conversely, from the without skill run what was in the skill that didn't need to be. It produces a skill change recommendation file for each eval. A further summary agent aggregates up all those recommendations in a way I can feed back to an agent.

by dsmmcken

5/7/2026 at 7:09:03 PM

[flagged]

by darkrishabh

5/7/2026 at 10:03:56 AM

The example model in the documentation is 4o-mini, you might want to update that to a more recent model.

As an aside, 4o-mini came out months before agent skills were released… I’m curious how it performs with choosing to load skills in the first place?

by ssgodderidge

5/7/2026 at 10:31:48 AM

It’s an artifact of the documentation being AI generated, they usually pick gpt4-era models, without giving it further thought.

For Gemini it seems to always pick 2.5 despite 3.1 being the latest, Claude the 3.5-era models.

Not sure what’s preventing AI labs on ensuring this stuff is refreshed during training.

by stingraycharles

5/7/2026 at 12:27:23 PM

I was wondering the same and learned the model doesn't know about itself during training [0]

[0] https://developers.googleblog.com/closing-the-knowledge-gap-...

by simonpure

5/7/2026 at 8:43:17 PM

the model doesn't know itself, but all these larger models are generating a significant amount of synthetic data from the prior models, and the prior models are all context bloated renditions; you fill the KV cache with whatever alignment you want, and then generate synthetic data.

That training on existing models is what brings out various other things about other models; then there's models that are just like snowballs, where you build one iteration, then you give it it's identity, then you train on that with the same synthetic generaiton.

So a model could generation include at some point it's own name.

by cyanydeez

5/8/2026 at 3:25:39 AM

I don’t think what you’re saying makes a lot of sense. You don’t “fill the KV cache with whatever alignment you want.” That doesn’t exist. The KV cache is an inference optimization, and is populated by running tokens through the model.

Synthetic data is generated by other models, and yes this is often where identity propagates.

I think with the snowballing you mean things like iterative self distillation? That’s definitely not done unsupervised, because of the risk of model collapse, and typically heavily curated and/or mixed with real data.

by stingraycharles

5/7/2026 at 10:26:05 AM

The skill is deterministically added to the prompt by the harness before the target model is invoked. There is no “choosing” to load a skill. You might be confusing skills with tools (MCP etc).

by block_dagger

5/7/2026 at 11:48:19 AM

The metadata is loaded by the harness, but the LLM still needs to choose to load the rest of the skill, no?

by ssgodderidge

5/7/2026 at 3:29:26 PM

You are correct. I'm not sure what the parent is trying to say.

by albedoa

5/7/2026 at 12:27:33 PM

Define “load.” It follows the instructions in the prompt - its natural behavior.

by block_dagger

5/7/2026 at 1:59:46 PM

I was using the term as you used in your comment. I believe the official term is "Activation" however:

> Activation: When a task matches a skill’s description, the agent reads the full SKILL.md instructions into context.[1]

> Full instructions load only when a task calls for them, so agents can keep many skills on hand with only a small context footprint.

[1]: https://agentskills.io/home#how-do-agent-skills-work

by ssgodderidge

5/7/2026 at 6:24:19 PM

Ah, I misunderstood this, thanks for the link. You are correct. I was assuming this system worked like CLAUDE.md in that it was deterministically added to the context without the LLM choosing to add it. My mistake.

by block_dagger

5/7/2026 at 12:38:51 PM

Concretely, it has to decide whether it is in a circumstance where that skill is useful, pull the instructions into the context and follow them.

by hyperpape

5/7/2026 at 2:17:25 PM

Yep, and as with any other instructions, it can sometimes not pull the skill even if the trigger conditions are there.

by cassianoleal

5/7/2026 at 8:43:59 PM

it depends on the harness. opencode appears to prompt the models with tools and skills when answering questions.

by cyanydeez

5/7/2026 at 5:29:55 PM

This is all still really early stuff, but there was a blog yesterday that got me thinking we need a way to send telemetry data for work being done by agents out to a central agent the org controls. It would be responsible for creating skills based on the work people are doing - or in other words the stuff they're correcting the agents on. And then you could develop skills for an entire department (customer service, engineering, marketing, etc).

This tool has me thinking there's some merit to setting that up. My only real qualm is that I'm not super convinced skills are that great yet. I'm trying to get better at developing them in my workflow, but still get a lot of results where they are ignored even after spending time trying to tighten them up.

by TheGRS

5/7/2026 at 9:20:36 AM

Are there any published results gathered using this?

by egeozcan

5/7/2026 at 11:28:54 AM

Not sure but I'm interested in trying it because I've for a while sensed that adding SKILLS.md degraded my overall experience - most probably I wrote them wrong. But this sort of tooling I guess can help me figure it out?

by jarym

5/7/2026 at 7:12:34 PM

Definitely, and this is something that needs more community support

by darkrishabh

5/7/2026 at 11:56:49 PM

With-skill vs without-skill evals are useful, but what about comparing skills against each other? Is there an emerging standard for saying one Skill is better than another, beyond custom pass/fail evals?

by codecheers

5/7/2026 at 10:08:40 AM

How do you iterate on the judge prompt? Is there an auto rater?

by ianhxu

5/7/2026 at 11:44:12 AM

That is the billion dollar question. Who watches the watchmen?

by datadrivenangel

5/7/2026 at 11:45:33 AM

the watchwatchmen

by blitzar

5/7/2026 at 12:09:07 PM

exactly

by ianhxu

5/8/2026 at 8:59:51 PM

one thing id want in the report is token cost per run alongside correctness. seen skills that technically improve outputs but cost 35-40% more tokens so they're not really wins in production. without that number the with/without comparison is only half the story

by VinamraYadav

5/7/2026 at 4:52:50 PM

Why so narrowly eval just with/without skill?

Same approach is useful for everything: model, params, prompt, sub-agents, skills, rag, etc?

by scosman

5/7/2026 at 7:13:40 PM

Then you go in the territory of benchmarking. But I love the idea here. Having standards around those can really help move the needle

by darkrishabh

5/7/2026 at 12:33:24 PM

having token counts surface on each side in the report would be super useful

by hiroto_lemon

5/12/2026 at 10:57:54 AM

[dead]

by galaSerge

5/11/2026 at 3:18:25 PM

[dead]

by obsidian_spider

5/9/2026 at 3:41:47 AM

[dead]

by gen99

5/9/2026 at 3:56:38 PM

[dead]

by Oxlamarr

5/7/2026 at 11:02:24 AM

[dead]

by bixxie09

5/7/2026 at 1:16:38 PM

[dead]

by ajaystream

5/8/2026 at 7:07:38 AM

[dead]

by hidai25

5/7/2026 at 8:10:21 AM

[dead]

by huflungdung