>if tested via the codex cli "harness" it wouldn't be a pure model-to-model comparison any more.Well that's already not a very fair comparison, we've known for years (one of the early-ish LLM papers, maybe someone knows which one) that prompting makes an enormous difference on agent performance, and most strikingly, the same prompt that massively boosts performance on one model, can massively reduce performance on another.
So you already need to fine-tune the prompts for the model, if you want anything approaching best results.
Now what's really amusing is that if you run models without their official harness, they can actually do way better on some benchmarks! [0] e.g. On Terminal Bench 2, Claude Opus 4.6 goes from #33 (Claude Code) to #5 (custom harness). Similar results for Codex.
Now, this is "for this one very specific benchmark", but I still thought it was funny, since you'd expect "the harness made by the same company" to be the best for all tasks, but that's clearly not the case. (For specific tasks, it's actually quite trivial to outperform a general purpose harness.)
[0] https://www.tbench.ai/leaderboard/terminal-bench/2.0