2/24/2026 at 11:17:47 AM
Several things going on here:- concurrency is very hard
- .. but object storage "solves" most of that for you, handing you a set of semantics which work reliably
- single file throughput sucks hilariously badly
- .. because 1Gb is ridiculously large for an atomic unit
- (this whole thing resembles a project I did a decade ago for transactional consistency on TFAT on Flash, except that somehow managed faster commit times despite running on a 400Mhz MIPS CPU. Edit: maybe I should try to remember how that worked and write it up for HN)
- therefore, all of the actual work is shifted to the broker. The broker is just periodically committing its state in case it crashes
- it's not clear whether the broker ACKs requests before they're in durable storage? Is it possible to lose requests in flight anyway?
- there's a great design for a message queue system between multiple nodes that aims for at least once delivery, and has existed for decades, while maintaining high throughput: SMTP. Actually, there's a whole bunch of message queue systems?
by pjc50
2/24/2026 at 1:48:04 PM
> The broker runs a single group commit loop on behalf of all clients, so no one contends for the object. Critically, it doesn't acknowledge a write until the group commit has landed in object storage. No client moves on until its data is durably committed.by jitl
2/24/2026 at 9:16:35 PM
Yea, the group commit is the real insight here.I read this blog post and to help wrap my head around it I put together a simple TCP-based KV store with group commit, helped make it click for me.
by aduffy
2/24/2026 at 12:44:23 PM
AFAIK you can kinda "seek" reads in S3 using a range header, WCGW? =Dby candiddevmike
2/24/2026 at 2:11:59 PM
You can, and it's actually great if you store little "headers" etc to tell you those offsets. Their design doesn't seem super amenable to it because it appears to be one file, but this is why a system that actually intends to scale would break things up. You then cache these headers and, on cache hit, you know "the thing I want is in that chunk of the file, grab it". Throw in bloom filters and now you have a query engine.Works great for Parquet.
by staticassertion
2/24/2026 at 3:27:21 PM
Yep! Other than random reads (~p99=200ms on larger ranges), it's essential to get good download performance of a single file. A single (range) request can "only" drive ~500 MB/s, so you need multiple offsets.by Sirupsen
2/24/2026 at 2:17:52 PM
Amazon S3 Select enables SQL queries directly on CSV, JSON, or Apache Parquet objects, allowing retrieval of filtered data subsets to reduce latency and costsby UltraSane
2/24/2026 at 2:25:53 PM
S3 Select is, very sadly, deprecated. It also supported HTTP RANGE headers! But they've killed it and I'll never forgive them :)Still, it's nbd. You can cache a billion Parquet header/footers on disk/ memory and get 90% of the performance (or better tbh).
by staticassertion
2/25/2026 at 4:48:05 PM
Caching Parquet headers/footers sounds super interesting. Can you say more about how you implemented it?by dotgov
2/25/2026 at 6:49:24 PM
Currently there's nothing in my headers, but the footer is straightforward. There's the schema, row group metadata, some statistics, byte offsets for each column in a group, page index, etc. It's everything you'd want if you wanted to reject a query outright or, if necessary, query extremely efficiently.min/max stats for a column are huge because I pre-encode any low-cardinality strings into integers. This means I can skip entire row groups without every touching S3, just with that footer information, and if I don't have it cached I can read it and skip decoding anything that doesn't have my data.
Footers can get quite large in one sense - 10s-100s of KB for a very large file. But that's obviously tiny compared to a multi-GB Parquet file, and the data can compress extremely well for a second/ third tier cache. You can store 1000s of these pre-parsed in memory no problem, and store 10s of thousands more on disk.
I've spent 0 time optimizing my footers currently. They can get smaller than they are, I assume, but I've not put much thought. In fact, I don't have to assume, I know that my own custom metadata overlaps with the existing parquet stats and I just haven't bothered to deal with it. TBH there are a bunch of layout optimizations I've yet to explore, like using headers would obviously have some benefits (streaming) whereas right now I do a sort of "attempt to grab the footer from the end in chunks until we find it lol". But it doesn't come up because... caching. And there are worse things than a few spurious RANGE requests.
by staticassertion
2/25/2026 at 7:18:01 PM
Have you tried AWS s3 tables which is a manged iceberg service?by UltraSane
2/25/2026 at 10:20:29 PM
I haven't. I'm sort of aware of it but I guess I prefer to just have tight control over the protocol/ data layout. It's not that hard and it gives me a ton of room to make niche optimizations. I doubt I'd get the same performance if I used it, but I could be wrong. Usually the more you can push your use case into the protocol the better.by staticassertion
2/24/2026 at 5:53:38 PM
Wow I didn't know that. To be fair now that S3 tables exists it is rather redundant.by UltraSane