4/7/2026 at 9:12:16 PM
This is essentially S3FS using EFS (AWS's managed NFS service) as a cache layer for active data and small random accesses. Unfortunately, this also means that it comes with some of EFS's eye-watering pricing:— All writes cost $0.06/GB, since everything is first written to the EFS cache. For write-heavy applications, this could be a dealbreaker.
— Reads hitting the cache get billed at $0.03/GB. Large reads (>128kB) get directly streamed from the underlying S3 bucket, which is free.
— Cache is charged at $0.30/GB/month. Even though everything is written to the cache (for consistency purposes), it seems like it's only used for persistent storage of small files (<128kB), so this shouldn't cost too much.
by MontyCarloHall
4/8/2026 at 4:12:47 AM
Thanks for the analysis. Interestingly when we first released our low latency s3-compatible storage (1M IOPS, p99 ~5ms)[1], a lot of people asking the same questions why we tried to bring file system semantics (atomic object/folder rename) to s3. We also got some feedback from people who really need FS sematics, and added POSIX FS support then.aws S3FS is using normal FUSE interface, which would be super heavy due to inherent overhead of copying data back and forth between user space and kernel space, that is the initial concern when we tried to add the POSIX support for the original object storage design. Fortunately, we have found and open-sourced a perfect solution [2]: using FUSE_OVER_IO_URING + FUSE_PASSTHROUGH, we can maintain the same high-performance archtecture design of our original object storage. We'd like to come out a new blog post explain more details and reveal our performance numbers if anyone is interested with this.
[1] https://fractalbits.com/blog/why-we-built-another-object-sto...
by thomas_fa
4/8/2026 at 3:33:11 PM
One advantage over S3FS would be that multiple filesystem mounts would see a consistent view of the filesystem, but it looks like this advantage disappears when mixing direct bucket access with filesystem mounts. Given the famously slow small file performance of EFS it might have been better (and cheaper) to send all files to S3 and only use EFS for the metadata layer. Not having atomic rename is also going to be a problem for any use that expects a regular filesystem.by objectivefs
4/8/2026 at 2:27:33 AM
This was my concern too. The whole point of using S3 as a file system instead of EBS / EFS (for me at least) is to minimize cost and I don't really see why I would use this instead of s3fs.by ktimespi
4/8/2026 at 6:33:56 AM
Probably some tradeoff at high client count or if you seek into files to read partial databy avereveard
4/7/2026 at 10:39:33 PM
> Large reads (>128kB) get directly streamed from the underlying S3 bucket, which is free.Always uncached? S3 has pretty bad latency.
by the8472
4/7/2026 at 11:27:00 PM
The threshold at which the cache gets used is configurable, with 128kB the default. The assumption is that any read larger than the threshold will be a long sustained read, for which latency doesn't matter too much. My question is, do reads <128kB (or whatever the threshold is) from files >128kB get saved to the cache, or is it only used for files whose overall size is under the threshold? Frequent random access to large files is a textbook use case for a caching layer like this, but its cost will be substantial in this system.by MontyCarloHall
4/7/2026 at 11:55:09 PM
NVMe read latency is in the 10-100µs range for 128kB blocks. S3 is about 100ms. That's 3-4 OOMs. The threshold where the total read duration starts to dominate latency would be somewhere in the dozens to hundreds of megabytes, not kilobytes.by the8472
4/8/2026 at 12:08:04 AM
I agree, it's an oddly low threshold. The latency differential of NFS vs. S3 is a couple OOMs, so a threshold of ~10MB seems more appropriate to me. Perhaps it's set intentionally low to avoid racking up immense EFS bills? Setting it higher would effectively mean getting billed $0.03/GB for a huge fraction of reads, which is untenable for most people's applications.by MontyCarloHall
4/8/2026 at 2:01:15 PM
Once upon a time S3 used to cache small objects in their keymap layer, which IIRC had a similar threshold. I assume whatever new caching layer they added is piggybacking that.This keeps the new caching layer simple and take advantage of the existing caching. If they went any bigger they'd likely need to rearchitect parts of the keymap or underlying storage layer to accommodate, or else face unpredictable TCO.
by mgdev
4/8/2026 at 12:18:01 AM
< NVMe read latency is in the 10-100µs range for 128kB blocks. S3 is about 100ms. That's 3-4 OOMs.Aren't you comparing local in-process latency to network latency? That's multiple OOM right there.
by antonvs
4/8/2026 at 12:41:42 AM
No, within the same DC network latency does not add that much. After all EFS also manages 600µs average latency. It's really just S3 that's slow. I assume some large fraction of S3 is spread over HDDs, not SSDs.by the8472
4/8/2026 at 10:24:23 AM
I imagine (hope) that they are doing some kind of intelligent read-ahead in the frontend servers to optimize for sequential reads that would avoid this looking terrible for applications.by huntaub
4/8/2026 at 4:31:14 AM
> directly streamed from the underlying S3 bucket, which is free.No reads from S3 are free. All outgoing traffic from AWS is charged no matter what.
by deepsun
4/8/2026 at 6:07:13 AM
Reads from s3 via an s3 endpoint inside a vpc to an interface inside of that vpc is not billed.by simtel20