alt.hn

7/2/2026 at 2:19:24 PM

Show HN: CLI tool for detecting non-exact code duplication with embedding models

https://github.com/rafal-qa/slopo

by rkochanowski

7/2/2026 at 2:19:57 PM

I built Slopo to solve one specific problem: finding similar code that is hardest to detect by other tools, coding AI agents, and humans.

It finds similar-looking code with embeddings. This detects more than just copy-paste clones or even clones with minor changes. Similar code is often not a clone to refactor, and this is a trade-off. Initial results need to be verified, but coding agents can do this quickly. Example prompts are available on https://slopo.dev

Additionally, similar code distant in the codebase is ranked higher to focus on less obvious duplication.

The results differ a lot depending on the codebase. I noticed that sometimes most of the detected duplicates are false positives, but the remaining ones are strong candidates to refactor or even bugs. Sometimes it reveals much more real duplication.

by rkochanowski

7/2/2026 at 6:11:49 PM

Correct me if I'm wrong, but looking at [1] it seems to be specifically using function definitions (I'm guessing this works with functions, methods, and lambdas (the "<unknown>" part)?) as units of repetition. If yes, that's fine, but I would seriously consider adding some settings to allow the user to control that granularity. Sometimes, the repeated code is a conditional branch within larger functions (i.e., "every else:" or "every except Ex:" looks the same). If the functions are large enough, the dissimilarity of the rest of the body would (probably?) cause such things to be missed.

I would also consider - perhaps as a separate pass, with scoring set differently - to analyze comments (especially docstrings in Python). If I read the code correctly, you're currently just stripping them, which is the right thing to do when looking for code duplication, but duplicated docstrings are also often a signal that something is wrong in the codebase. The "different scoring" is because we expect docstring to be structured similarly (at least more than normal code), so some tweaking would be needed.

Finally: very nice project, congrats! :)

[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...

by klibertp

7/2/2026 at 2:51:40 PM

If it did PHP I would love to run it over WordPress. What would it take to add that?

by realxrobau

7/2/2026 at 3:16:22 PM

PHP support can be easily added, I will release a new version soon.

by rkochanowski

7/2/2026 at 3:44:17 PM

Thank you

by raro11

7/2/2026 at 7:36:53 PM

The false positive rate you're describing matches what we see running similarity detection on generated text instead of code: cosine similarity alone flags a lot of same-topic pairs that aren't actually duplicates. What helped was combining the embedding score with a structural signal (AST edit distance for code, overlapping headings and citations for text) so no single metric makes the call. Also worth surfacing the raw similarity score in the CLI output instead of just a binary duplicate flag, since people will want to tune the threshold per codebase.

by nttylock

7/2/2026 at 6:37:21 PM

Cool project, I've been meaning to do this myself at work for a codebase, and it's nice to see that this exists now.

Does the project you simply compute embeddings for every function unit and cluster them, or do we also mean-pool significant dependencies of a function? In other words, given the function

    def a():
      b()
      c()
      d()
Do we also embed b, c, and d as well and combine them somehow in the embedding of a?

by supriyo-biswas

7/2/2026 at 8:11:40 PM

Based on your example there is only a single function a() which is embedded. The rest is just a code and dependencies are not resolved. Did you think about adding this feature in your tool?

by rkochanowski

7/2/2026 at 7:06:01 PM

It looks like it works only on function bodies[1]. I'm not sure I understand why you would want to look at invoked callables code, though. Calling the same set of helper functions is already flagged; repeated code in helpers is flagged as well when those helpers are analyzed. Do you have a specific example where you'd like a function flagged as a duplicate based on the code it calls out to?

[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...

by klibertp

7/2/2026 at 6:09:03 PM

I implemented this for a large monorepo last year, it runs as an analysis during code review and it shows what are possible similar snippets wrt the code under review. It was a very nice project. It also allows to see across the repo what are the most common constructs for the different languages. This could also be helpful to see if some code has been copied e.g. from open source projects.

by vander_elst

7/2/2026 at 3:28:29 PM

Nice idea. I can see this being useful before refactors, especially when the duplication is semantic rather than copy paste.

by murats

7/2/2026 at 4:40:48 PM

This is neat. Have you noticed any difference in duplicate detection between strongly typed and loosely typed languages / code bases?

by philajan

7/2/2026 at 5:42:06 PM

No. It depends the most on general code quality and architecture. Some implementations require more code similarity by design. Some languages, like Java, may tend to have more duplication, but it's only a theoretical guess. It also depends on what kind of software is developed with what language.

If you are interested in data, you can check my article. Analysis was done with this tool, but a previous version where exact-copy duplicates were excluded from analysis. https://rkochanowski.com/article/analysis-code-duplication/

by rkochanowski

7/2/2026 at 4:59:43 PM

What a simple and smart idea. Wonderful

by BrandiATMuhkuh

7/2/2026 at 4:36:00 PM

Very nice. I can imagine putting this into a pre push hook to keep things clean after an initial sweep.

by hdz

7/2/2026 at 6:06:23 PM

have you considered a deterministic tier before the embedding pass? I feel that approach can be more efficient.

by rohanat

7/2/2026 at 6:26:09 PM

There are good mature tools for deterministic duplication detection and I intentionally focused on embedding-based to fill this gap (I didn't find other tools using this approach).

If by "more efficient" you mean to avoid embedding of the same code multiple times, this optimization is already implemented internally.

by rkochanowski

7/2/2026 at 6:42:32 PM

We did this by using the ASTs you can go quite far without embeddings and the result is easier to debug and follow what's going on.

by vander_elst

7/2/2026 at 7:29:11 PM

[flagged]

by danielsmori

7/2/2026 at 3:26:44 PM

This is a great use case for embeddings. Code deduplication across distant modules is notoriously hard for traditional AST-based tools.

How do you handle chunking and parsing for different languages to make sure the embeddings capture semantic meaning effectively? For instance, do you chunk by functions/classes, or use a fixed token window? If a function is too long or too short, it can drastically skew the embedding similarity.

by NYCHMPAI

7/2/2026 at 6:51:04 PM

Generally, I chunk by function/method (not by whole class), but different languages have specific concepts and features. Nested code units, anonymous functions, lambdas, closures are extracted as separate chunks.

The chunk size has allowed range and those outside are simply ignored.

- Upper limit is hardcoded with a body size of 10k chars

- Lower limit is configurable with a default of 10 AST nodes inside the body

The chunking strategy is something that can be improved in future versions.

by rkochanowski

7/2/2026 at 4:46:11 PM

I think that this is pretty cool, but is there any reason why we would want to remove similar/possible duplicate code?

by SpyCoder77

7/2/2026 at 6:03:37 PM

Recently there was a popular article on HN saying that sometimes code duplication is better than abstraction, so I assume that this question is not a joke.

While testing this tool, one detected duplication was interesting for a use case. Permission check logic was duplicated and placed in different distant places in the codebase. The code was similar, but not identical, the logic was not the same. One version had stricter checks. I analyzed this with the coding agent, and we found out that both versions are used for the same thing, which means that in some cases validation is insufficient. Having only a single validation place, this bug could be prevented or easily detected.

by rkochanowski

7/2/2026 at 5:39:50 PM

(without sarcasm) Is this a serious question?

If so - maintainability, testability. This is old software engineering best practice at this point.

You shouldn’t hyper optimize for deduplication, but it’s usually worth considering. Fewer places to fix issues or improve as well.

by rufius

7/2/2026 at 6:02:10 PM

I tend to follow the "rule of 3": a second similar implementation is OK, introducing the third triggers a refactor. As with everything, this isn't dogma, and sometimes the second implementation is already too much, while at other times you get tens of similar code sections (in codegen, repeating patterns with almost no changes is a virtue). But it's a good rule of thumb.

On testability: two implementations can be tested against each other, leading to greater coverage with less test code. It doesn't work that way for 3+ implementations, which is another reason not to have that many.

by klibertp

7/2/2026 at 5:36:06 PM

Have you written software before?

by Zopieux