2/1/2026 at 11:12:41 PM
I'm genuinely surprised that there isn't column-level shared-dictionary string compression built into SQLite, MySQL/MariaDB or Postgres, like this post is describing.SQLite has no compression support, MySQL/MariaDB have page-level compression which doesn't work great and I've never seen anyone enable in production, and Postgres has per-value compression which is good for extremely long strings, but useless for short ones.
There are just so many string columns where values and substrings get repeated so much, whether you're storing names, URL's, or just regular text. And I have databases I know would be reduced in size by at least half.
Is it just really really hard to maintain a shared dictionary when constantly adding and deleting values? Is there just no established reference algorithm for it?
It still seems like it would be worth it even if it were something you had to manually set. E.g. wait until your table has 100,000 values, build a dictionary from those, and the dictionary is set in stone and used for the next 10,000,000 rows too unless you rebuild it in the future (which would be an expensive operation).
by crazygringo
2/2/2026 at 11:09:26 AM
> Is it just really really hard to maintain a shared dictionary when constantly adding and deleting values? Is there just no established reference algorithm for it?Enums? Foreign key to a table with (id bigint generated always as identity, text text) ?
> I have databases I know would be reduced in size by at least half.
Most people don't employ these strategies because storage is cheap and compute time is expensive.
by solatic
2/2/2026 at 4:03:39 PM
The current prices for SSDs and DRAM may change the strategies.In a mini-PC that I have assembled recently, with a medium-priced CPU (an Arrow Lake H Core Ultra 5) the cost of storage has been 60% of the total cost of the computer, and it was only 60% because I have decided to buy much less memory than I would have bought last summer (i.e. I bought 32 GB DRAM + 3 TB SSDs, while I wanted double amounts, but then the price would have become unacceptable).
Moreover, I bought the mini-PC and memories in Europe, but the same exact computer model from ASUS, with the same memories, has in USA, on Newegg, a price about 35% higher than in Europe, and a similar ratio is valid for other products, so US customers are likely to be even more affected by this great increase in the cost of storage.
by adrian_b
2/3/2026 at 2:23:02 AM
> BIGINTI’d use a SMALLINT (or in MySQL, a TINYINT UNSIGNED) for a lookup table. The bytes add up in referencing tables.
> Most people don't employ these strategies because storage is cheap and compute time is expensive.
Memory isn’t cheap. If half of your table is low-cardinality strings, you’re severely reducing the rows per page, causing more misses, slowing all queries.
by sgarland
2/2/2026 at 12:28:59 AM
compression is not free, dictionary compression:1, complicates and slows down update, which is typically more important in OLTP than OLAP
2, is generally bad for high cardinality columns, which requires tracking cardinality to make decisions, which further complicates things.
lastly, additional operational complexity (like the table maintenance system you described in last paragraph) could reduce system reliability, and they might decide it's not worth the price or against their philosophy.
by analyst74
2/2/2026 at 2:06:04 AM
Strings in textual index are already compressed, with common prefix compression or other schemes. They are perfectly queryable. Not sure if their compression scheme is for index or data columns.Global column dictionary has more complexity than normal. Now you are touching more pages than just the index pages and data page. The dictionary entries are sorted, so you need to worry about page expansion and contraction. They sidestep the problems by making it immutable, presumably building it up front by scanning all the data.
Not sure why using FSST is better than using a standard compression algorithm to compress the dictionary entries.
Storing the strings themselves as dictionary IDs is a good idea, as they can be processed quickly with SIMD.
by ww520
2/2/2026 at 6:51:00 AM
> Not sure why using FSST is better than using a standard compression algorithm to compress the dictionary entries.I believe the reason is that FSST allows access to individual strings in the compressed corpus, which is required for fast random access. This is more important for OLTP than OLAP, I assume. More standard compression algorithms, such as zstd, might decompress very fast, but I don't think they allow that
by randomuser47
2/2/2026 at 7:05:19 PM
I've worked on a hobby database that did something like this, but instead of "flat" dictionary compression over columns, it used a tree of compression contexts - trillions of them.Data was compressed in interior and exterior trees, where interior trees were the data structure inside blocks (similar to B-tree block contents), and exterior trees were the structure between blocks (similar to B-tree block pointers, but it didn't use a B-tree, it was something more exotic for performance).
Each node provided compression context for its children nodes, while also being compressed itself using its parent's context.
As you can imagine, the compression contexts had to be tiny because they were everywhere. But you can compress compression contexts very well :-)
Using compression in this way removed a few special cases that are typically required. For example there was no need for separate large BLOB storage, handling of long keys, or even fixed-size integers, because they fell out naturally from the compression schema instead.
The compression algorithms I explored had some interesting emergent properties, that weren't explicitly coded, although they were expected by design. Some values encoded into close to zero bits, so for example a million rows would take less than a kilobyte, if the pattern in the data allowed. Sequence processing behaved similar to loops over fast, run-length encodings, without that being actually coded, and without any lengths in the stored representation. Fields and metadata could be added to records and ranges everywhere without taking space when not used, not even a single bit for a flag, which meant adding any number of rarely-used fields and metadata was effectively free, and it could be done near instantly to a whole database.
Snapshots were also nearly free, with deltas emerging naturally from the compression context relationships, allowing fast history and branches. Finally, it was able to bulk-delete large time ranges and branches of historical data near instantly.
The design had a lot of things going for it, including IOPS performance for random and sequential access, fast writes as well as reads, and excellent compression.
I'd like to revive the idea when I have time. I'm thinking of adding a neural network this time to see how much it might improve compression efficient, or perhaps implementing a filesystem with it to see how it behaves under those conditions.
by jlokier
2/2/2026 at 12:25:34 AM
There are some databases that can move an entire column into the index. But that's mostly going to work for schemas where the number of distinct values is <<< rowcount, so that you're effectively interning the rows.by hinkley
2/2/2026 at 3:51:54 AM
Duckdb can also handle SQLite files: https://duckdb.org/docs/stable/core_extensions/sqliteby pstuart
2/2/2026 at 8:40:41 AM
In the case of sqlite you can just use ZFS and get page level compression.by andersmurphy
2/2/2026 at 5:28:47 AM
How do you layout all that variable length data in memory m?by groundzeros2015