3/19/2026 at 5:02:03 PM
I really want automated QA to work better! It's a great thing to work on.Some feedback:
- I definitely don't want three long new messages on every PR. Max 1, ideally none? Codex does a great job just using emoji.
- The replay is cool. I don't make a website, so maybe I'm not the target market, but I'd like QA for our backend.
- Honestly, I'd rather just run a massive QA run every day, and then have any failures bisected, rather than per-PR.
- I am worried that there's not a lot of value beyond the intelligence of the foundation models here.
by blintz
3/20/2026 at 3:25:03 PM
This benchmark measures whether tests are relevant, coherent, and have good coverage. But there's a more subtle type of error: AI creates tests that look specific to PR but are actually generic patterns mapped from the training data—correct test structure, reasonable assertions, but not actually interacting with what this specific piece of code does.How do you differentiate between ""understood the code and generated a targeted test" and "recognized this looks like an auth flow and produced a standard auth test template"? The latter might still pass your coherence/relevance metrics while missing the actual exception.
by thienannguyencv
3/19/2026 at 5:37:47 PM
Agree on your last point and it's going to be a very bitter lesson. In any case, you probably wanna shift alot of the code verification as left as possible so doing review at PR time isnt the right strat imo. And claude/codex are well positioned to do the local review.by Bnjoroge
3/20/2026 at 6:54:56 PM
Agree on the shift left concept, but curious on your thoughts about a checker-maker loop. Running a PR review bot is different from running /review on local dev right? And also there has been instance of Claude already patching the test scripts instead of fixing the bugs to make the tests pass.by ashgam
3/20/2026 at 1:27:46 AM
[flagged]by arkheosrp26
3/20/2026 at 5:14:17 AM
Isn’t the last point the case with every AI startup? Nobody has a moat and it’s tough to build one because the playing field is so level.by monkpit
3/20/2026 at 11:47:30 AM
I've been confused by this with many LLM products in general. Sometimes infrastructure is part of it so there's that, but often it seems like the product is a magic incantation of markdown files.by _heimdall
3/20/2026 at 6:57:10 PM
Solving for infrastructure is a huge part of the problem too. Curious to know what you think about it?by ashgam
3/20/2026 at 7:15:54 PM
Here I'm mostly considering the seemingly countless services that are little more than some markdown files and their own API passing data to/from the LLM procider's API.By no means is that every AI product today, and I wasn't saying the OP QA service falls into that bucket though.
More of a general comment related to the GP, maybe too off topic here though?
by _heimdall
3/19/2026 at 5:13:06 PM
Thanks for the feedback! - Agreed that the form factor can be condensed with a link to detailed information - With the codebase understanding, backend is where we are looking to expand and provide value - The intelligence of the models does lay out the foundation but combining the strength of these models unlocks a system of specialized agents that each reason about the codebase differently to catch the unknown unknownsby Visweshyc