Open Code Review – An AI-powered code review CLI tool

6/5/2026 at 5:16:57 AM

Ran it on a subset of 10 of the 50 PRs in this benchmark https://codereview.withmartian.com

- very good recall (~74%, e.g. found a lot of the golden issues)

- not so good precision (~12%, e.g. lots of false positives)

- the precision causes the F1 to tank (~20%, if this stays the same on the full 50 sample it would puts it almost last, even less than Kilo+Grok)

by eranation

6/5/2026 at 5:47:16 AM

I would say that recall is the most important metric here though. I'd want it to catch all the issues.

False positives are easy to ignore.

by akie

6/5/2026 at 6:27:48 AM

What, no they're not. You still need to analyze them to understand they are false positives. It's time wasted

by witx

6/5/2026 at 10:00:11 AM

Agree, it's something that will eventually teach your developers to ignore points raised as it's mostly garbage.

by chaoz_

6/5/2026 at 7:39:18 AM

Finding problems is optimizing for the customer. Avoiding false positives is optimizing for the developer. Which is right depends on your org's culture.

by onion2k

6/5/2026 at 7:54:27 AM

If I flag every line in your PR as a potential security bug then I have 100% recall.

Obviously you need a mixture of high recall and low false positive rate. If 7/8 flagged items are fine its much more likely people will ignore the warnings, much like they would any security tool with a 90% false positive rate. That is not optimized for the customer.

by evolve-maz

6/5/2026 at 8:20:57 AM

The ideal is finding all the problems without getting any false positives, but the reality is that you can't often have that. An org's engineering culture should be designed to fix problems with systems. If you're seeing an 87.5% false positive rate that should be seen as another engineering problem to fix. However, that's a separate issue to whether or not you accept false positives in a system designed to find problems.

Presenting it as either a system that misses real problems or a system that has a huge number of false positives is a false dilemma. You can have a system that's designed to find all the problems and then optimize it to reduce the false positives. If you can't reduce the number then you optimize to identify false positives as fast as possible. Just ignoring the identified problems on the assumption that they're false is giant red flag and a signal that the org has a very a broken engineering culture (but, as you say, that's quite common.)

by onion2k

6/5/2026 at 8:05:08 AM

Yep. Similarly - you can predict with 99.9% accuracy if a Volcano will erupt today by using a rock that has "No" written on it.

by eranation

6/5/2026 at 9:30:17 AM

> If I flag every line in your PR as a potential security bug then I have 100% recall.

No. A code review isn't about "flagging a line of code", it's about identifying an issue or a risk. If a 10-line PR has one issue and you leave a comment on every single character, if you still miss the issue you have 0% recall.

by williamdclt

6/5/2026 at 11:21:35 AM

Which LLM did you use? I assume that will make a pretty big difference.

by tirpen

6/5/2026 at 12:16:09 PM

gpt-5-mini and gpt-5.5 (had to tweak the code a bit to make it work)

Surprisingly not as big of a difference as one would hope. It turns out that smarter models are more conservative. Smarter model / More thinking = slightly worse recall sometimes.

I think it says more about the benchmark itself perhaps. Reviews are highly opinionated. And it could be that the smarter models are actually better, just the “golden” state is very opinionated.

by eranation

6/6/2026 at 4:09:10 PM

That must have been expensive! Thanks for running the benchmark and sharing.

I tested Coducky (my AI review macos app) on the full 50-PR Martian benchmark using qwen3.7-plus via OpenRouter as the reviewer with a lightweight pre-save precision gate with deepseek-v4-flash. The score (gpt 5.2 judge) was 43.0% precision / 35.8% recall / 39.0 F1. That puts it about inline with CodeRabbit. This cost around $7 to run the full 50 PRs.

Your post inspired me to set up a test harness for my app to continue to test model combinations. Coducky allows you to select whichever models/subscriptions you like to run reviews, but it could make sense to build a collection of model combinations that work well for this.

by jayphen

6/5/2026 at 11:02:51 AM

False positives from the deterministic audits a very difficult problem to address. Comparing and deduplicating across different methods or LLM audits seems to the only way.

by bobkb

6/5/2026 at 8:07:03 AM

[flagged]

by isabellehue

6/5/2026 at 1:53:27 AM

If you've codex what does it add over codex's default app? I am confused. Can't you simply ask codex in another tab to just do a code review?

by faangguyindia

6/5/2026 at 3:49:39 AM

Developers should definitely use whatever tool they use to review the code they (or the tool) just wrote. We have a skill that does this in a loop - spin subagents, review (based on our coding standards), triage the review in another subagent, fix what's applicable, push back on what's not, and we run this in a loop. This is before you even open a PR.

The idea of a PR is for others to find things that you have a blind spot to, and also leave some paper trail on the thought process. E.g. if something was not fixed, there is a history of a comment and a reason on WHY it wasn't fixed. If you do all that only locally, that context is lost.

We noticed that even after doing this self review loop multiple times, we still find issues (either via other models / tools or via humans that have the "tribal knowledge")

Maybe one day AI will write perfect code and can review itself, but even if it's 0.1% chance it has a bug, or 1 in a million it will do something a bit sinister (like open a backdoor just in case you try to shut it down) - then I really think there is always going to be a need for humans to review something.

by eranation

6/5/2026 at 2:43:25 AM

> Can't you simply ask codex in another tab to just do a code review?

You are likely to get better results if you do not use the same model for review that wrote the code. I typically use Opus for code editing and GPT 5.5 for peer review using an automation with skills.

Training set is different between models. If there are gaps in coverage in one model, you want a different model reviewing the work. The second model will its own gaps, but the gap list is not identical.

by cheema33

6/5/2026 at 7:42:51 AM

> You are likely to get better results if you do not use the same model for review that wrote the code

There’s no evidence of this. I guess you are anthropomorphising models (i.e., it’s good that - different human reviews your code)

by sdevonoes

6/5/2026 at 9:22:56 AM

Yeah, one model over another seems to matter less, they respond differently to the same prompts, so if anything, I'd use multiple prompts over choosing one model over another.

However, using two models to generate two reviews easily beats doing one model and one review, as some models seem to "care" more about certain things, but you'll just miss different things if you change the model rather than add more.

by embedding-shape

6/5/2026 at 4:42:38 PM

There is some evidence.[1] The best reviewer is a different model with fresh context, worst is same model with same context.

1. https://arxiv.org/pdf/2603.04582

by tylermarques

6/5/2026 at 2:20:50 PM

well they are different. human or not. so it makes sense to get it reviewing by "something" different that one that wrote code.

by dominotw

6/5/2026 at 5:56:58 AM

Results also depend on the prompt. You get different results if you ask to review the PR and focus on particular file than if you don't make it focus.

Or if you make it "be a security engineer" with particular focus points.

Or make it a grammar nazi, it will find way more typos than without such focus.

Of course all of those "focuses" needs to be in a separate context (agent/subagent) to make it work.

by krzyk

6/5/2026 at 2:55:33 AM

I would suggest that you reverse those roles. gpt-5.5 as the implementer and Opus as the reviewer.

by Art9681

6/5/2026 at 4:05:07 AM

They find different things, and there's no reason to use one model for review. You want to review it until there's nothing left to be unearth.

And if you put the review effort into polishing an impl plan, then it doesn't matter which model implements it either.

by hombre_fatal

6/5/2026 at 3:26:26 AM

How come? I find Opus to have better taste and GPT to have more rigor.

by pluralmonad

6/5/2026 at 9:17:40 AM

Mechanics of running their command aside, I think the main value add is all the rules: https://github.com/alibaba/open-code-review/tree/main/intern...

Like with "SKILL" files in general, it's got to do with Prompt Engineering: https://en.wikipedia.org/wiki/Prompt_engineering#Rationale

by pramodbiligiri

6/5/2026 at 2:22:06 AM

Presumably nothing. Do note the publisher—Alibaba presumably would rather their own tools and models instead of licensing.

They do open source a fair bit of internal tooling, so it’s always interesting to see their approach

by eyeris

6/5/2026 at 5:51:47 AM

It can be used outside of local machine.

We built something similar, it looks for new PRs where the bot is added and does reviews. Makes the code more tuned toward similar rules. I can't assume that a developer run a code review tool himself (just as I don't assume he/she run a build - so we run builds also).

It is just another perspective for code review, besides human. Unfortunately it uses a lot of tokens, and considering that Anthropic, OpenAI and Github Copilot all moved to token based pricing, it is quite a money burner.

by krzyk

6/5/2026 at 2:09:36 AM

We'd need a benchmark to tell.

by esafak

6/5/2026 at 1:35:58 AM

I'm interested in trying this.

We have our own internal automated review which has shown positive results, but I would love to drop it if I find something better.

Code review is currently our bottleneck, so any possibility of better automating it is welcome.

by singingtoday

6/5/2026 at 9:41:10 AM

Thermonuclear suggested by someone below is good. Matt Poccock did a demo/breakdown of that: https://www.youtube.com/watch?v=mh5XZ-L5SFQ. He has his own "improve-codebase-architecture" skill: https://github.com/mattpocock/skills/blob/main/skills/engine...

Some of them are about general coding guidelines and code quality, not necessarily vetting your current PR against specs! There's AbsolutelySkilled with clean-code and clean-architecture. Linking to older version of repo because they seem to be no longer on trunk: https://github.com/AbsolutelySkilled/AbsolutelySkilled/tree/...

I've been creating some rules to help with my Java coding: https://github.com/bitkentech/shipsmooth/tree/main/skills/ex.... These are assembled into a SKILL file when this skill file template is built: https://github.com/bitkentech/shipsmooth/blob/main/skills/ex...

by pramodbiligiri

6/5/2026 at 2:35:37 AM

I've been liking this code review skill lately, it has pointed out some good improvements. https://github.com/cursor/plugins/blob/main/cursor-team-kit/...

by sergeym

6/5/2026 at 2:14:47 AM

[flagged]

by Supermancho

6/5/2026 at 1:58:11 AM

At a kill s@@s hackathon at work, I was able to build something that

uses a node image installs claude code runs a /review-like command puts inline comments to PR deletes old comments when rerunning

OCR seems cool, but overkill, and I'm definitely not using Code Rabbit after their CEO was on here acting snobbish a while back.

Point being AI code review in Git** itself isn't hard to do and can add a lot of value quickly.

by elpakal

6/5/2026 at 3:27:00 AM

Nothing against coderabbit or SaaS specifically, but this was one of the reasons I stopped using it https://kudelskisecurity.com/research/how-we-exploited-coder...

It's very easy to build a basic code review tool. It's hard to build one that developers won't ask you to turn off because of false positives (or one that will miss your next escaped bug)

I think if all the tool does is run a claude code level /review skill (which all developers should definitely run before they even open a PR) then isn't this a bit of a review theater? Just a guardrail to those developers who don't run a /review-triage-fix skill in /loop before they take the PR out of draft?

I wonder how many PRs in the world got to production where several developers commented on each other's code, and none of them read anything, just used their gh cli / MCP to post / answer comments / fix issues on their behalf.

There is going to be an exponential growth of code generated, and you can't escape AI code review, but also there is no real difference between having Claude Code write the code and review itself locally, vs communicating with itself via a slow and downtime prone medium of "PR comments"

tl;dr - without any human in the loop reviewing the AI code review, or skimming to see what the AI code review missed, there is no real reason to use a "code review" you can just run it as part of the CI/CD and hope AI won't miss anything (according to my linkedin feed, there are people out there who really thing this way...)

by eranation

6/5/2026 at 5:41:58 AM

I think that in most cases you either agree on a PR comment or you don't. But it has to leave a mark in PR. This is how we do reviews, ignoring PR comment is one of the worst offenses one can make. I don't let it go.

by krzyk

6/5/2026 at 4:17:40 AM

Yes! Where it gets really interesting is the scenario in which every developer has their own unique review skill/workflow, so the reviews end up being different than you running it yourself, but nobody is reading them still.

by s900mhz

6/5/2026 at 3:08:54 AM

How snobbish was the CEO acting?

by gardnr

6/5/2026 at 7:36:51 AM

Rule files are in https://github.com/alibaba/open-code-review/tree/main/intern... (in Chinese)

by hrpnk

6/5/2026 at 8:14:23 AM

An English rendering of the Java.md (Google Translate): https://github-com.translate.goog/alibaba/open-code-review/b...

by pramodbiligiri

6/5/2026 at 9:17:29 AM

And for comparison, here's a GitHub gist with three versions, first the original Chinese one, then the Google Translate version you put and finally a translated done with ChatGPT Pro: https://gist.github.com/embedding-shapes/7a51d565214bd676890...

Done that way mainly to see how the Google Translate version compared with a ChatGPT translation (revision: https://gist.github.com/embedding-shapes/7a51d565214bd676890...)

by embedding-shape

6/5/2026 at 8:01:39 AM

I like the pattern of making a dedicated cli/harness and just build a skill to teach coding agents to use it.

At $work we built a thorough workflow to do security reviews, which is a pure skill to simplify adoption https://www.synthesia.io/post/automating-code-security-revie...

But the user experience is tricky because if we aim for very low false positives the run time for this kind of workflows is too long, it's then hard to justify blocking PRs.

by gbrindisi

6/5/2026 at 4:49:42 AM

> After installation, the ocr command is available globally.

Wish they chose a different acronym...

by weird-eye-issue

6/5/2026 at 4:54:58 AM

[dead]

by altmanaltman

6/5/2026 at 10:01:38 AM

A repo with the English translation of each of the rules files, using Google Translate: https://github.com/pramodbiligiri/open-code-review-rules.

The original rules files (in Chinese): https://github.com/alibaba/open-code-review/tree/main/intern...

by pramodbiligiri

6/5/2026 at 10:26:01 AM

I guess since you copy-pasted your comment here, yet didn't include a more proper and correct translation (again), here is my other comment again:

Done that way mainly to see how the Google Translate version compared with a ChatGPT translation (revision: https://gist.github.com/embedding-shapes/7a51d565214bd676890...)

by embedding-shape

6/5/2026 at 6:59:56 AM

i did something like this, but somewhat in reverse. you are the one that reviews the code and you instruct AI what to do through code review comments: https://parley.cloudflavor.io.

thinking about it, it would be funny to first run alibaba's tool and then run parley after.

posted it here a few days ago: https://news.ycombinator.com/item?id=48369782 i guess with AI there are too many Show HN now, and i never got any type of feedback.

by pi-victor

6/5/2026 at 7:17:57 AM

Just a small note, the font on your site is very annoying to read, the characters are not aligned horizontally (Windows w Chrome). Looks to be a scaling issue, if I zoom to 200% it shows fine.

by viblo

6/5/2026 at 8:51:41 AM

ah, sorry about that - will try to see what is going on. thanks for letting me know!

by pi-victor

6/5/2026 at 1:47:13 AM

We've been using Coderabbit, great deal ($30/mo/dev flat) and finds a lot.

I also built a skill I call `/meta-review` that asks Codex, Cursor, and Gemini to review the code (I use Claude Code). It always finds little things claude & I missed.

Coderabbit just came out with their own PR review UI that's great for big PRs, it groups files together etc. https://www.coderabbit.ai/blog/introducing-atlas-the-first-a...

by atestu

6/5/2026 at 2:02:00 AM

Is it actually flat fee? I loved Cursor bugbot which was flat fee but they moved to per-run and that killed it for me, but a lot of others are doing the same.

by causal

6/5/2026 at 2:51:00 AM

Yes! They just have a rate limit but we never run into it (we’re just 3 people though).

Yea I liked bugbot too but it became pretty pricey.

by atestu

6/5/2026 at 3:08:27 AM

Not sure why you got downvoted, and I have nothing against CodeRabbit, but this comment feels a bit like a paid ad :)

How do you see CodeRabbit against other AI code review solutions? E.g. cubic.dev, Qodo, Graphite, Greptile, Baz, Augment Code...

An alternative UI to GitHub is well overdue. But once someone will get it right, everyone will copy them...

by eranation

6/5/2026 at 10:36:55 AM

That’s exactly why I’m getting downvoted, and I get it tbh. I knew people would ask for recommendations. It’s ok.

I haven’t used any of the tools you mentioned. We started using coderabbit just this year. The new PR review UI just came out. It’s made for big AI reviews which internally we’re trying to rein in. I like the direction they’re moving in with that, it uses AI to help you rather than bypass you. So you have the automated review that catches a bunch, and then they have a tool for you to step in and do your own review faster.

It’s funny there’s someone replying to me saying coderabbit is the best they’ve ever seen and in another thread someone else says it’s the worst. If that’s not AI for you… you just gotta try it

by atestu

6/5/2026 at 1:01:22 PM

Tried it a while back and my team asked me to remove it. But maybe they improved since then.

by eranation

6/5/2026 at 4:34:34 AM

I've tried many AI code review tools. Nothing comes close to the depth of CodeRabbit reviews. It's the only such tool that can find real logical bugs. I'd love to be able to get Claude Code to do similar quality of review, but I can't get it right, no matter how I try.

by lukaslalinsky

6/5/2026 at 12:21:17 PM

Even if this was true, it’s hard to believe, and written a bit like an ad. Eg no vendor would get me to write a comment like this. All the more so, I tried CR and my team asked me to remove it. Maybe they got better but it’s a bit weird to me considering they only charge $20 and Claude Code say they estimate the same cost for a single PR for their competing product.

by eranation

6/5/2026 at 1:52:49 PM

Well, I'm just an extremely happy user. I've honestly tried to find an alternative, and couldn't. I'm using it in the context of solo developer and it provides a huge value to me.

by lukaslalinsky

6/5/2026 at 1:21:19 PM

how does it compare to the red hat ai code review?

https://gitlab.com/redhat/edge/ci-cd/ai-code-review

Has anyone experience with that one?

by Luker88

6/5/2026 at 7:39:58 AM

Is not working with gpt5.x models (Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.) which is hardcoded. I dont know why this is on the front page. My review-with-codex skill is working just fine, consuming my usage and not API tokens.

by sfortis

6/5/2026 at 11:34:37 AM

I'm sorry — we didn't expect to receive so much attention from the developer community so soon after open-sourcing the project. Some parts of the codebase are not yet fully polished. We are currently refactoring the LLM module and will address this as soon as possible. Once again, I sincerely apologize for the inconvenience.

by lizhengfeng101

6/5/2026 at 12:22:21 PM

Managed to get it to work with those actually. (Had to change the code of course)

by eranation

6/5/2026 at 10:41:23 AM

Did it review the landing page for it? Because it looks broken on iOS.

by hanspagel

6/5/2026 at 2:02:59 AM

I recently moved off Cursor's BugBot because it's no longer a flat $40, and I feel a little lost trying to find a viable alternative because there are so many and the pricing kind of sucks for all of them. Curious if anyone has a recommendation.

by causal

6/5/2026 at 2:26:54 AM

My team tried coderabbit and qodo and they are both trash compared to a tool we quickly built in-house that is more or less a thin wrapper around claude/codex, along with per-repo skills. PR review is triggered by webhooks from github to the review tool's web app. The tool shared by OP from alibaba certainly does some things ours does not and appears more sophisticated, but we have never had the problems they mention.

"The agent can read full file contents, search the codebase, inspect other changed files for context, and produce deep reviews — not just surface-level diff feedback." our tool does all this too. It catches dumb typos as well as more complicated bugs. Not to mention it is great as a ratchet (https://qntm.org/ratchet). It is not a substitute for reviews from other engineers though, since obviously it does nothing to achieve one of the main goals of code review, which is to socialize knowledge of the codebase.

Alibaba's work here is almost certainly more advanced than what we've done, but ours has been perfectly satisfactory and better than the paid offerings we've tried. I think most teams should not be paying SaaS fees for AI code review, that is the kind of business that mostly should not exist any more.

by lukeasrodgers

6/5/2026 at 8:07:52 AM

In which areas do you feel like the mentioned are bad? Do they find less and your own solution has more success?

If the latter, do you know why?

by mrklol

6/5/2026 at 8:18:01 AM

gitar.ai is flat with no limits

by kageiit

6/5/2026 at 8:48:43 AM

this is a great tool, until you try reading the rule files, I had find a translator to make heads of it. given that it is CLI tool is great dev the tinker with it at no additional cost.

by nutifafa

6/5/2026 at 3:01:10 AM

I wonder how they do against this benchmark (not that I vetted this benchmark... but still interesting to know...)

https://codereview.withmartian.com

by eranation

6/5/2026 at 10:41:26 AM

Not to be confused with Opencode the harness

by singiamtel

6/5/2026 at 2:04:54 PM

[flagged]

by panavm

6/5/2026 at 9:01:15 AM

[flagged]

by songting591

6/5/2026 at 5:34:45 AM

[flagged]

by Aegis_01

6/5/2026 at 6:07:05 AM

[flagged]

by AashmanShukla

6/5/2026 at 12:31:14 PM

[flagged]

by eddysir

6/5/2026 at 2:13:03 AM

[flagged]

by xuanlin314

6/5/2026 at 4:32:15 AM

Thank you all for the interest in Open Code Review!

This project was incubated from an AI code review tool that has been widely used by developers inside Alibaba at scale. The reason we decided to open-source it is simple — we noticed that many developers in the community are either paying for similar tools or using skills to perform AI code reviews.

As someone who has done deep research in this space, I think skills are actually a great approach, and running them as sub-agents is an elegant way to reduce context pollution. That said, skills do come with inherent limitations from general-purpose agents — they can be hard to debug, hard to evaluate, and difficult to tune. That's why we rewrote our internal tool in Go as a CLI and open-sourced it. Our goal is simple: free, token-efficient, and better results — while being easy to integrate into agent frameworks like Claude Code and Codex.

Our Design Philosophy: Deterministic Engineering × Agent Hybrid We believe the best code review system combines the reliability of engineering with the flexibility of AI.

Deterministic Engineering — for hard constraints

We use engineering logic (not LLMs) to handle the parts of code review that simply cannot go wrong:

Precise file filtering — Clearly defines which files need review and which should be excluded, ensuring no critical change is ever missed. Intelligent file bundling — Groups related files into the same review unit (e.g., message_en.properties and message_zh.properties are packed together). Each bundle is handled as an independent sub-agent with isolated context — this divide-and-conquer strategy performs exceptionally well on large changesets and naturally supports concurrent review. Fine-grained rule matching — Matches review rules based on file characteristics, keeping the model's attention focused and eliminating information noise from the start. Compared to pure LLM-driven rule guidance, template-engine-based rule matching produces more stable and predictable behavior. Standalone location & reflection components — Independent comment localization and comment reflection modules systematically improve both the positional accuracy and content quality of AI feedback. Agent — for dynamic decision making

We let the Agent shine where it truly excels — dynamic reasoning and context retrieval:

Scenario-optimized prompts — Deeply tuned prompt templates for code review scenarios, improving output quality while significantly reducing token consumption. Curated scenario-specific toolset — Based on in-depth analysis of tool call traces from large-scale production data — including call frequency distribution, repeated invocation rates per tool, and the impact of adding new tools on overall call chains — we carefully selected and restructured the general-purpose agent toolset into a specialized toolkit that is more stable and predictable in code review scenarios. Due to some internal dependencies and compliance requirements, a few features haven't been released publicly yet. But I believe as more external developers show interest in this tool, we'll accelerate the alignment between our internal and external versions.

Finally, a huge thank you to everyone following this project. We want it to keep getting better, and we hope to see more free, high-quality tools like this emerge from the community.

by lizhengfeng101

6/5/2026 at 1:36:53 PM

[dead]

by shine320

6/5/2026 at 12:22:08 PM

[flagged]

by aos_architect

6/5/2026 at 2:40:29 PM

[dead]

by jimmysongio