Agents.md file isn't the problem. Your lack of Evals is

2/25/2026 at 4:18:53 PM

so how would you eval your own claude.md? Each context is unique to the project, team, and personal root claude.md. Do you just take given task and ask it to redo the same one over and over again against a known solution? Do you just keep using it and "feel" whether or not it's working? How is that different from what everyone is already doing?

by hamuraijack

2/25/2026 at 3:42:52 PM

I don't even know what an eval is.

by pavel_lishin

2/25/2026 at 8:18:03 PM

If it was easy to write evals, I would come at it from that direction.

But since it's not, what I do to avoid working on AGENTS.md blind is I test it on whatever causes me to write it.

I have some prompt, the AI messes it up in some way that I think it shouldn't, maybe it's something I've seen it do before and I'm sick of it. So I update AGENTS.md, revert the changes, /undo in the chat context and re-submit the same prompt.

by furyofantares

2/25/2026 at 3:34:29 AM

Okay, but how would I write evals for my project's agents file? Any good examples out there?

by skybrian

2/25/2026 at 12:15:03 PM

The agents are smart enough to write the evals too.

It's agents all the way down!

Submit a GitHub repo containing skills to Tessl, and it will generate the evals, run them, and present the results. https://tessl.io/registry/skills/submit

The evals and results are all shown, no login necessary, so you can assess them yourself. e.g. https://tessl.io/registry/skills/github/coreyhaines31/market... (click details to see the eval texts).

by popey

2/25/2026 at 3:25:32 PM

At first glance this looks like an entire ecosystem full of slop and by running that eval you generate more? I'm looking for something a bit more curated.

by skybrian

2/25/2026 at 7:19:41 AM

I wrote https://ai-evals.io (community site) to make the concept approachable no matter what tools you choose to use.

You can learn about them evaluating that site https://github.com/Alexhans/eval-ception and then the pattern should be easy to test on your own thing.

by alexhans

2/25/2026 at 3:31:23 PM

Doing an eval on itself is clever but confusing for the reader. How about a tutorial explaining how to do an evals on something more normal?

by skybrian

2/25/2026 at 5:10:50 PM

I'd be happy to. One thing that is tough is knowing what will resonate with the audience and not being too simple or too complex.

What do you think would resonate with you or with the audience you're thinking about?

That repo also has an illustrative eval for Agent Skill in Airflow for Localization

https://github.com/Alexhans/eval-ception/tree/main/exams/air...

by alexhans

2/25/2026 at 6:56:41 PM

How about taking a small, real open source project that has an AGENTS.md and showing how to add evals and optimize it?

The question I have is: what are we optimizing for and how do we measure it?

In your own repos, I see you have a fork of safepass, which seems like a nice simple project, but it doesn't have an agents file yet.

by skybrian

2/25/2026 at 2:45:17 PM

I mean.. Claude kept putting in deprecated APIs for code I was getting it to write, so I adjusted the prompt to say not to + it seemed to help.

by stuaxo

2/25/2026 at 12:47:36 PM

Ai;dr

by theodorewiles