2/11/2026 at 11:47:09 PM
This looks like a nice rundown of how to do this with Python's zstd module.But, I'm skeptical of using compressors directly for ML/AI/etc. (yes, compression and intelligence are very closely related, but practical compressors and practical classifiers have different goals and different practical constraints).
Back in 2023, I wrote two blog-posts [0,1] that refused the results in the 2023 paper referenced here (bad implementation and bad data).
by ks2048
2/12/2026 at 2:03:24 AM
Concur. Zstandard is a good compressor, but it's not magical; comparing the compressed size of Zstd(A+B) to the common size of Zstd(A) + Zstd(B) is effectively just a complicated way of measuring how many words and phrases the two documents have in common. Which isn't entirely ineffective at judging whether they're about the same topic, but it's an unnecessarily complex and easily confused way of doing so.by duskwuff
2/12/2026 at 11:50:23 AM
If I'm reading this right, you're saying it's functionally equivalent to measuring the intersection of ngrams? That sounds very testable.by andai
2/12/2026 at 7:32:46 PM
Mostly. There's also confounding effects from factors like the length of the texts - e.g. when compressing Zstd(A+B), it's more expensive to encode a backreference in B to some content in A when the distance to that content is longer, so longer texts will appear less similar to each other than short texts.by duskwuff
2/12/2026 at 7:30:22 AM
I do not know inner details of Zstandard, but I would expect that it to least do suffix/prefix stats or word fragment stats, not just words and phrases.by srean
2/12/2026 at 10:59:01 AM
The thing is that two English texts on completely different topics will compress better than say and English and Spanish text on exactly the same topic. So compression really only looks at the form/shape of text and not meaning.by Jaxan
2/12/2026 at 11:53:06 AM
Yes of course, I don't think anyone will disagree with that. My comment had nothing to do with meaning but was about the mechanics of compression.That said, lexical and syntactic patterns are often enough for classification and clustering in a scenario where the meaning-to-lexicons mapping is fixed.
The reason compression based classifiers trail a little behind classifiers built from first principles, even in this fixed mapping case, is a little subtle.
Optimal compression requires correct probability estimation. Correct probability estimation will yield optimal classifier. In other words, optimal compressors, equivalently correct probability estimators are sufficient.
They are however not necessary. One can obtain the theoretical best classifier without estimating the probabilities correctly.
So in the context of classification, compressors are solving a task that is much much harder than necessary.
by srean
2/12/2026 at 9:01:25 AM
It's not specifically aware of the syntax - it'll match any repeated substrings. That just happens to usually end up meaning words and phrases in English text.by duskwuff
2/12/2026 at 6:01:54 AM
Yup. Data compression ≠ semantic compression.by D-Machine
2/12/2026 at 12:05:02 AM
Good on you for attempting to reproduce the results & writing it up, and reporting the issue to the authors.> It turns out that the classification method used in their code looked at the test label as part of the decision method and thus led to an unfair comparison to the baseline results
by shoo
2/12/2026 at 1:40:24 PM
Author here. Thank you very much for the comment. I will take a look. This is a great case of Cunningham's law!by Lemaxoxo