alt.hn

4/22/2026 at 12:30:14 PM

Columnar Storage Is Normalization

https://buttondown.com/jaffray/archive/columnar-storage-is-normalization/

by ibobev

4/22/2026 at 1:11:12 PM

The normalization analogy is genuinely clever as a teaching tool, but it quietly papers over the fact that normalization is a logical design concept while columnar storage is a physical one - treating them as the same thing can mislead more than it clarifies, I think

by immanuwell

4/22/2026 at 1:40:06 PM

I've always preferred to think of normalization as more about "removing redundancy" than in the frame it is normally presented. Or, to put it another way, rather than "normalizing" which has as a benefit "removing redundancy", raise the removing of redundancy up to the primary goal which has as a side benefit "normalization".

A nice thing about that point of view is that it fits with your point; redundancy is redundancy whether you look at it with a column-based view or a row-based view.

by jerf

4/22/2026 at 5:09:42 PM

Oh man, thank you for saying this. The difference between logical and physical goes over so many people's heads. It's a little unnerving at times how much people resist it.

Definitely agree with what you said - if we treat them as the same thing that's going to mislead some folks.

by _doctor_love

4/24/2026 at 2:38:30 AM

And it's really not the same thing because ids are not position. With this type of structure a field without value will be a missing row in a table. Moreover, from the moment each column is isolated in its own table with a second key column, it becomes essential not to store NULL values because the amount of data doubles for each value, since a primary key is required for each column. It's a 6NF model. See <https://en.wikipedia.org/wiki/Sixth_normal_form>

by LkpPo

4/22/2026 at 8:03:28 PM

Yeah I feel like papering over the physical aspect actually misses the main motivation for columnar storage in the first place, which is to more efficiently store some types of data and perform OLAP queries on it.

by zaptheimpaler

4/22/2026 at 1:25:17 PM

Fair, but one of the big benefits of normalization was the benefit on storage and memory back in the day which was tiny comparatively.

There's always a reason for a dev to ship something shitty but when you show you can use 80% less storage for the same operation you can make the accountants your lever.

by hilariously

4/22/2026 at 2:56:25 PM

The purpose of normalization is not to save storage. In fact it might often require more storage, since it involves introducing a foreign-key column. It really depends on the data in question whether it saves storage or require more.

by bazoom42

4/22/2026 at 3:23:08 PM

Fair, I said one of the big benefits, not the purpose - in some cases it can require more storage (but that storage is often more amenable to compression) -but generally deduplicating your data doesn't increase your storage needs.

by hilariously

4/22/2026 at 4:39:06 PM

That forign key column is saving duplicating multiple columns

But I don’t think that’s the top 5 reasons of normalization

by zaphirplane

4/22/2026 at 2:00:09 PM

Nonsense. See Codd’s first paper.

1NF removes repeating groups, putting for example data for each month in its own row, not an array of 12 months in 1 row.

Storage efficiency was never the point. IMS had that locked down. Succinctness of expression and accuracy of results was the point. And is: normalization prevents anomalous results.

by jklowden

4/22/2026 at 2:31:00 PM

I think parent was saying it’s a benefit, not the original purpose. If I store a FK to a table containing my company’s corporate address, that is a tremendous savings in storage (and memory pressure), and it also eliminates update anomalies.

by sgarland

4/22/2026 at 3:38:06 PM

And other people are saying it may not always have that effect.

Normalization ultimately boils down to breaking your data down into the most elemental and simplest "facts" about them, and its greatest value is how it allows and encourages more flexible and disparate ways of looking at the same information.

by cmrdporcupine

4/22/2026 at 2:42:20 PM

Normalizing repeating groups doesn't offer significant savings when they are completely populated (e.g. each entity has the full 12 monthly values per year), but other types of normalization do. For example dependent data are actually redundant.

by HelloNurse

4/22/2026 at 6:19:47 PM

To expand on this point: unexplained "distinct" is often a code smell

by anonymars

4/22/2026 at 5:08:30 PM

Theoretically I would agree, but practically I still wonder why we need different database engines for row and columnar storage if supporting different types of indices is trivial(TM) for Postgres?

by goerch

4/22/2026 at 10:54:30 PM

There are definitely hybrid OLTP-OLAP databases, or HTAP (Hybrid Transaction and Analytical Processing). Microsoft and, I believe, Oracle both have HTAP tech.

The most novel design, I think, is CedarDB (developed by Thomas Neumann's database group at TUM), which adaptively stores both row and column versions of the data [1], where some data is permanently compressed to columnar format and hot rows are converted "just in time" to columnar data as needed.

[1] https://cedardb.com/blog/colibri/

by atombender

4/22/2026 at 5:53:47 PM

In theory, you don't. In practice, it's because the major SQL DBMS were architected around row-oriented storage and the technical effort to implement hybrid storage is large.

There are columnar storage engine extensions for many of the popular databases, though.

by gavinray

4/22/2026 at 6:09:27 PM

Interesting: I transferred the idea of a matrix being stored row or column wise to the database world and assumed this was a more physical than theoretical feature (not a native speaker here)?

Looking forward to check out `pg_duckdb`, yes.

by goerch

4/22/2026 at 7:15:42 PM

there are hybrid engines too, so-called HTAP (as opposed to OLAP or OLTP) notable efforts are: SingleStore (commercial) and OceanBase (foss)

by gfody

4/22/2026 at 9:27:27 PM

Microsoft SQL Server can use both row and column store for tables, in various combinations such as row store for the table with a columnar index, or vice versa.

by jiggawatts

4/22/2026 at 1:37:06 PM

I always thought that the biggest benefit of normalization was deduplicating mutable values so you only need to update values in one place and everything stays nicely in sync.

Classic example being something like a “users” table that tracks account id, display name (mutable), and profile picture (mutable). And then a “posts” table that has post id, account id, and message text. This allows you to change the display name/picture in one place and it can be used across all posts

by parpfish

4/22/2026 at 2:28:05 PM

That is a practical benefit, absolutely. A different way of looking at it is that you’re eliminating data anomalies (generally, update anomalies).

by sgarland

4/22/2026 at 2:43:01 PM

This is usually the case when talking about normalization in the contex of relational databases (2nd normal form, 3rd normal form etc.). But normalization really just means to bring data into some standardized form.

by bazoom42

4/22/2026 at 2:11:57 PM

This is exactly domain key normal form!

https://en.wikipedia.org/wiki/Domain-key_normal_form

by remywang

4/22/2026 at 3:26:41 PM

Not exactly. It is 6th normal form.

by bazoom42

4/22/2026 at 7:55:50 PM

it's actually neither. normal forms provide semantic guarantees through structure. the structure proposed here does not constrain the data in any way and therefor cannot be said to be a normalization - tfa is dead wrong basically.

by gfody

4/22/2026 at 2:06:12 PM

It is possible to treat as purely relational but it can be suboptimal on data access if you follow through with it.

The main cost is on the join when you need to access several columns, it's flexible but expensive.

To take full advantage of columnar, you have to have that join usually implicitly made through data alignment to avoid joining.

For example, segment the tables in chunks of up to N records, and keep all related contiguous columns of that chunk so they can be independently accessed:

    r0, r1 ... rm; f0, f0 ... f0; f1, f1 ... f1; fn, fn ... fn
That balances pointer chasing and joining, you can avoid the IO by only loading needed columns from the segment, and skip the join because the data is trivially aligned.

by juancn

4/22/2026 at 4:27:22 PM

UPDATE's are also a challenge. It's very efficient for mass inserts/append workloads but updating columnar data can be an efficiency challenge.

by brightball

4/23/2026 at 3:47:20 AM

yeah updates are where it falls over for us. inserts were fine, reads were great, but any workflow that needed to correct a small slice of rows after the fact got painful fast. we ended up keeping the row store for the hot path and rebuliding the columnar copy overnight. probably not elegant but it stopped the bleeding.

by kippinsula

4/22/2026 at 1:20:21 PM

This is an interesting thought, even if it doesn’t come with practical consequences. A person could argue that if you happen to encode your table with a columnar format, you very likely won’t use indexes for every “value” but the order itself of that specific block. But this would mean that if you’re using the data order meaningfully, you’d probably going against the principles of table normalization. But, again, this one as well can be considered the result of excessive overthinking rather something practical that can be used.

by Lucasoato

4/22/2026 at 4:05:47 PM

I don't fully agree with this, for large nested datasets and arrays.

Especially with arrays, what could be one line of JSON, in a CSV you'd have non-normalized array as a string in a single cell, or you expand the array and create a single value for the cell, creating $array_size number of rows.

You can normalize data in just about any structured format, but columns aren't the end-all-be-all normalization format. I think pandas uses "frames".

by notepad0x90

4/22/2026 at 5:39:04 PM

>but columns aren't the end-all-be-all normalization format. I think pandas uses "frames".

Pandas is column oriented, as are basically all high performance data libraries. Each column is a separate array of data. To get a "row" you take the n item from each of the arrays.

And FWIW, column-oriented isn't considered normalization. It's a physical optimization that can yield enormous performance advantages for some classes of problems, but can cause a performance nightmare for other problems.

Data analytics loves column-oriented. CRUD type stuff does not. And in the programming realm there are several options to have Structures of Arrays (SoA) instead of the classic Arrays of Structures (AoS).

by llm_nerd

4/22/2026 at 9:59:41 PM

makes sense, I guess I just meant that it isn't proper normalization without typing. you can have types in something like a sql db (or frames as you pointed out). But a simple CSV, not so much, you'll have to come up with a custom type scheme using headers or something. So long as arrays are strongly typed, I suppose a simple cell in a column is enough.

by notepad0x90

4/22/2026 at 8:06:51 PM

My mental model of columnar storage is as the old notion of parallel arrays, which I used in the 1970s with FORTRAN. Whatever you learned first sticks with you and you end up translating everything to that, or at least I do. I believe this is known as the baby duck syndrome.

by wpollock

4/22/2026 at 2:55:54 PM

The Apache Arrow array format docs are a great read if you're interested by this blog post.

by data-ottawa

4/22/2026 at 1:59:35 PM

None-or-many?

by pwndByDeath

4/22/2026 at 1:12:59 PM

Is this meant to be a poor explanation of sixth normal form?

by orangepanda

4/22/2026 at 2:23:56 PM

THANK YOU. I was confused at the normalization example given, and had to think through it. (id, name, age) is already at 5NF, and the only one it doesn’t satisfy is 6NF.

by sgarland