Jepsen: NATS 2.12.1

12/8/2025 at 8:57:28 PM

Every time someone builds one of these things and skips over "overcomplicated theory", aphyr destroys them. At this point, I wonder if we could train an AI to look over a project's documentation, and predict whether it's likely to lose commmitted writes just based on the marketing / technical claims. We probably can.

by stmw

12/8/2025 at 10:17:59 PM

/me strokes my long grey beard and nods

People always think "theory is overrated" or "hacking is better than having a school education"

And then proceed to shoot themselves in the foot with "workarounds" that break well known, well documented, well traversed problem spaces

by awesome_dude

12/8/2025 at 10:52:29 PM

certainly a narrative that is popular among the grey beard crowd, yes. in pretty much every field i've worked on, the opposite problem has been much much more common.

by whimsicalism

12/9/2025 at 1:29:33 AM

What fields? Cargo culting is annoying and definitely leads to suboptimal solutions and sometimes total misses, but I’ve rarely found that simply reading literature on a thorny topic prevents you from thinking outside the box. Most people I’ve seen work who were actually innovating (as in novel solutions and/or execution) understood the current SOTA of what they were working on inside and out.

by johncolanduoni

12/9/2025 at 9:31:07 AM

I suspect they were more referring to curmudgeons not patching.

I was engaged after one of the worlds biggest data leaks. The Security org was hyper worried about the cloud environment, which was in its infancy, despite the fact their data leak was from on-prem mainframe style system and they hadn't really improved their posture in any significant way despite spending £40m.

As an aside, I use NATs for some workloads where I've obviously spent low effort validating whether it's a great idea, and I'm pretty horrified with the report. (=

by ownagefool

12/8/2025 at 11:04:22 PM

what's the opposite problem statement?

by _zoltan_

12/8/2025 at 11:33:28 PM

People overly beholden to tried and true 'known' way of addressing a problem space and not considering/belittling alternatives. Many of the things that have been most aggressively 'bitter lesson'ed in the last decade fall into this category.

by whimsicalism

12/8/2025 at 11:55:40 PM

Like this bug report?

The things that have been "disrupted" haven't delivered - Blockchains are still a scam, Food delivery services are worse than before (Restaurants are worse off, the people making the deliveries are worse off), Taxis still needed to go back and vet drivers to ensure that they weren't fiends.

by awesome_dude

12/9/2025 at 12:21:18 AM

> Blockchains are still a scam

Did you actually look at the blockchain nodes implementation as of 2025 and what's in the roadmap? Ethereum nodes/L2s with optimistic or zk-proofs are probably the most advanced distributed databases that actually work.

(not talking about "coins" and stuff obviously, another debate)

by hbbio

12/9/2025 at 12:59:58 AM

> Ethereum nodes/L2s with optimistic or zk-proofs are probably the most advanced distributed databases that actually work.

What are you comparing against? Aren't they slower, less convenient, and less available than, say, DynamoDB or Spanner, both of which have been in full-service, reliable operation since 2012?

by otterley

12/9/2025 at 5:03:21 AM

I think they mean big-D "Distributed", i.e. in the sense that a DHT is Distributed. Decentralized in both a logical and political sense.

A big DynamoDB/Spanner deployment is great while you can guarantee some benevolent (or just not-malevolent) org around to host the deployment for everyone else. But technologies of this type do not have any answer for the key problem of "ensure the infra survives its own founding/maintaining org being co-opted + enshittified by parties hostile to the central purpose of the network."

Blockchains — and all the overhead and pain that comes with them — are basically what you get when you take the classical small-D distributed database design, and add the components necessary to get that extra property.

by derefr

12/9/2025 at 9:53:14 AM

Ethereum is so good at being distributed than it's decentralized.

DynamoDB and Spanner are both great, but they're meant to be run by a single admin. It's a considerably simpler problem to solve.

by hbbio

12/9/2025 at 8:03:53 AM

Which are both systems with a fair amount of theory behind them !

by Agingcoder

12/9/2025 at 1:12:28 AM

the big difference is the trust assumption, anyone can join or leave the network of nodes at any time

by drdrey

12/9/2025 at 2:48:25 AM

I think you are being downvoted because Ethereum requires you to stake 32 Eth (about $100k), and the entry queue right now is about 9 days and the exit queue is about 20 days. So only people with enough capital can join the network and it takes quite some time to join or leave as opposed to being able to do it at any time you want.

by charcircuit

12/9/2025 at 6:32:54 AM

ok but these are details, the point is that the operators of the database are external, selfish and fluctuating

by drdrey

12/9/2025 at 1:20:55 AM

The traditional way is paper trails and/or WORM (write-once-read-many) devices, with local checksums.

You can have multiple replica without extra computation for hash and stuffs.

by j16sdiz

12/9/2025 at 5:54:44 PM

idk, sounds like you're ignoring tried and true microeconomic theoretical principles about consumer surplus. better get back to the books before commenting

by whimsicalism

12/8/2025 at 11:18:23 PM

The ivory tower standing in the way of delivering value I think.

by MrDarcy

12/8/2025 at 11:32:37 PM

To be more specific, goals of perfection where perfection does not at all matter.

by colechristensen

12/9/2025 at 3:54:32 AM

What does bothering to read some distributed systems literature have to do with demanding unnecessary perfection? Did NATS have in their docs that JetStream accepted split brain conditions as a reality, or that metadata corruption could silently delete a topic? You could maybe argue the fsync default was a tradeoff, though I think it’s a bad one (not the existence of the flag, just the default being “false”). The rest are not the kind of bugs you expect to see in a 5 year old persistence layer.

by johncolanduoni

12/9/2025 at 4:09:40 AM

Exactly, "losing data from acknowledged writes" is not failing to be perfect, it's failing to deliver on the (advertised) basics of storing your data.

by stmw

12/9/2025 at 5:52:42 AM

Last time I was at school requirement analysis was a thing, but do go off.

by LaGrange

12/9/2025 at 2:16:55 PM

I don't have a "school education" and I know plenty of theory, I certainly have read the papers cited in this test.

by staticassertion

12/9/2025 at 2:39:39 PM

You might not have a school education, but you have educated yourself. It is unfortunately common to hear people complain that the theory one learns in school (or by determined self-study) is useless, which I think is what the geybeard comment you replied to intends to say.

by mzl

12/9/2025 at 6:42:43 PM

OK, the real differences between self directed study, and school based study:

1. School based is supposed to cover all the basics, self directed you have to know what the basics are, or find out, and then cover them.

2. School based study the teachers/lecturers are supposed to have checked all the available text on the subject and then share the best with the students (the teachers are the ones that ensure nobody goes down unproductive rabbitholes)

3. People can see from the qualifications that a person has met a certain standard, understands the subject, has got the knowledge, and can communicate that to a proscribed level.

Personal note, I have done both in different careers, and being "self taught" I realised that whilst I definitely knew more about one topic in the field than qualified individuals, I never knew what the complete set of study for the field was (i never knew how much they really knew, so could never fill the gaps I had)

In CS I gained my qualification in 2010, when i went to find work a lot of places were placing emphasis on self taught people who were deemed to be more creative, or more motivated, etc. When I did work with these individuals, without fail they were missing basic understanding of fundamentals, like data structures, well known algorithms, and so on.

by awesome_dude

12/9/2025 at 11:51:50 AM

The only post in this thread that actually summarized the core findings of the study, namely:

- ACKed messages can be silently lost due to minority-node corruption.

- A single-bit corruption can cause some replicas to lose up to 78% of stored messages

- Snapshot corruption can propagate and lead to entire stream deletion across the cluster.

- The default lazy-fsync mode can drop minutes of acknowledged writes on a crash.

- A crash combined with network delay can cause persistent split-brain and divergent logs.

- Data loss even with “sync_interval = always” in presence of membership changes or partitions.

- Self-healing and replica convergence did not always work reliably after corruption.

…was not downvoted, but flagged... That is telling. Documented failure modes are apparently controversial. Also raises the question: What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?

So what is next? Nominate NATS for the Silent Failure Peace Prize?

by belter

12/9/2025 at 12:43:28 PM

> Nominate NATS for the Silent Failure Peace Prize?

One or two of the comments on GitHub by the NATS team in response to Issues opened by Kyle are also more than a bit cringeworthy.

Such as this one:

"Most of our production setups, and in fact Synadia Cloud as well is that each replica is in a separate AZ. These have separate power, networking etc. So the possibility of a loss here is extremely low in terms of due to power outages."

Which Kyle had to call them out on:

"Ah, I have some bad news here--placing nodes in separate AZs does not mean that NATS' strategy of not syncing things to disk is safe. See #7567 for an example of a single node failure causing data loss (and split-brain!)."

https://github.com/nats-io/nats-server/issues/7564#issuecomm...

by traceroute66

12/9/2025 at 9:56:58 PM

> What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?

I have to note the following as a NATS fan:

  - I am horrified at Jespen's reliability findings, however they do vindicate certain design decisions I made in the past

  - 'Core NATS' is really mostly 'redis pubsub but better' and Core NATS is honestly awesome, low friction middleware. I've used it as part of eventing systems in the past and it works great.

  - FWIW, There's an MQTT bridge that requires Jetstream, but if you're just doing QoS 0 you can work around the other warts.

  - If you use Jetstream KV as a cache layer without real persistence (i.e. closer to how one uses Redis KV where it's just memory backed) you don't care about any of this. And again Jetstream KV IMO is better than Redis KV since they added TTL.

All of that is a way to say, I'd bet a lot of them are using Core NATS or other specific features versus something like JetStream.

tl;dr - Jetstream's reliability is horrifying apparently but I stand by the statement that Core NATS and Ephermal KV is amazing.

by to11mtm

12/9/2025 at 7:41:04 AM

You can have DeepWiki literally scan the source code and tell you:

> 2. Delayed Sync Mode (Default)

> In the default mode, writes are batched and marked with needSync = true for later synchronization filestore.go:7093-7097 . The actual sync happens during the next syncBlocks() execution.

However, if you read DeepWiki's conclusion, it is far more optimistic than what Aphyr uncovered in real-world testing.

> Durability Guarantees

> Even with delayed fsyncs, NATS provides protection against data loss through:

> 1. Write-Ahead Logging: Messages are written to log files before being acknowledged

> 2. Periodic Sync: The sync timer ensures data is eventually flushed to disk

> 3. State Snapshots: Full state is periodically written to index.db files filestore.go:9834-9850

> 4. Error Handling: If sync operations fail, NATS attempts to rebuild state from existing data filestore.go:7066-7072"

https://deepwiki.com/search/will-nats-lose-uncommitted-wri_b...

by PeterCorless

12/9/2025 at 10:31:56 AM

> if you read DeepWiki's conclusion, it is far more optimistic

Well, its an LLM ... of course its going to be optimistic. ;-)

by traceroute66

12/9/2025 at 5:00:33 PM

"You are entirely correct!"

by PeterCorless

12/9/2025 at 12:01:43 PM

and your point is ...?

by 63stack

12/9/2025 at 4:06:56 PM

I don't think they were making a point. Someone suggested using an LLM for this, someone then responded by using an LLM for it.

What you draw from that seems entirely up to you. They don't seem to be making any claims or implying anything by doing so, just showing the result.

by staticassertion

12/9/2025 at 5:00:02 PM

Exactly.

by PeterCorless

12/9/2025 at 2:17:40 PM

You can DIY without aphyr.

by esafak

12/9/2025 at 2:57:38 PM

But this example of DIY led to incorrect conclusions about data integrity.

by otterley

12/9/2025 at 5:59:00 PM

It's not even "overcomplicated theory" it's just "commit your writes before you say you committed your writes". It's actually way, way more complicated to try to build a system that tries to be correct without doing that.

by staticassertion

12/10/2025 at 2:51:29 AM

You don’t even have to train an AI. At this point, in lieu of evidence to the contrary, we should default to “it loses committed writes”.

by asa400

12/8/2025 at 10:46:41 PM

I've asked LLMs to do similar tasks and the results were very useful.

by dboreham

12/9/2025 at 1:30:33 AM

I can’t wait until it’s good enough to vibecode the next MongoDB.

by johncolanduoni

12/9/2025 at 10:19:17 AM

Aim for all three of CAP to really hit the right vibes.

by lnenad

12/9/2025 at 12:31:23 PM

For anyone dealing with databases, and especially distributed databases, I highly recommend reading the Jepsen page on consistency models: https://jepsen.io/consistency/models

It provides a dictionary of terms that we can use to have educated discussions, rather than throwing around terms like "ACID".

by jwr

12/9/2025 at 4:27:09 PM

I love that resource and reference it fairly frequently.

There is also this [1] which Aphyr collabed on which you might find interesting if you haven’t seen it yet.

[1] https://antithesis.com/resources/reliability_glossary/

by plandis

12/8/2025 at 10:22:36 PM

NATS be trippin, no CAP.

by rishabhaiover

12/8/2025 at 10:32:16 PM

Underrated

by veverkap

12/8/2025 at 9:42:59 PM

Wow. I’ve used NATS for best-effort in-memory pub/sub, which it has been great for, including getting subtle scaling details right. I never touched their persistence and would have investigated more before I did, but I wouldn’t have expected it to be this bad. Vulnerability to simple single-bit file corruption is embarrassing.

by johncolanduoni

12/8/2025 at 7:23:12 PM

Sort of related. Jepsen and Antithesis recently released a glossary of common terms which is a fantastic reference.

https://jepsen.io/blog/2025-10-20-distsys-glossary

by vrnvu

12/9/2025 at 9:05:32 AM

A tiny bit more context here:

https://github.com/nats-io/nats-server/discussions/3312#disc...

(I opened this discussion 2.5 years ago and get an email from github every once in a while ever since. I had given up hope TBH)

by jessekv

12/8/2025 at 7:34:47 PM

> 3.4 Lazy fsync by Default

Why? Why do some databases do that? To have better performance in benchmarks? It’s not like that it’s ok to do that if you have a better default or at least write a lot about it. But especially when you run stuff in a small cluster you get bitten by stuff like that.

by merb

12/8/2025 at 8:25:00 PM

It's not just better performance on latency benchmarks, it likely improves throughput as well because the writes will be batched together.

Many applications do not require true durability and it is likely that many applications benefit from lazy fsync. Whether it should be the default is a lot more questionable though.

by aaronbwebber

12/8/2025 at 9:33:24 PM

It’s like using a non-cryptographically secure RNG: if you don’t know enough to look for the fsync flag off yourself, it’s unlikely you know enough to evaluate the impact of durability on your application.

by johncolanduoni

12/8/2025 at 10:59:29 PM

> if you don’t know enough to look for the fsync flag off yourself,

Yeah, it should use safe-defaults.

Then you can always go read the corners of the docs for the "go faster" mode.

Just like Postgres's infamous "non-durable settings" page... https://www.postgresql.org/docs/18/non-durability.html

by traceroute66

12/9/2025 at 1:16:58 AM

You can batch writes while at the same time not acknowledging them to clients until they are flushed, it just takes more bookkeeping.

by semiquaver

12/9/2025 at 11:12:38 AM

I also think fsync before acking writes is a better default. That aside, if you were to choose async for batching writes, their default value surprises me. 2 minutes seems like an eternity. Would you not get very good batching for throughout even at something like 2 seconds too? Still not safe, but safer.

by tybit

12/8/2025 at 10:47:09 PM

For transactional durability, the writes will definitely be batched ("group commit"), because otherwise throughput would collapse.

by senderista

12/9/2025 at 12:39:14 PM

> Many applications do not require true durability

Pretty much no application requires true durability.

by otabdeveloper4

12/9/2025 at 6:01:55 PM

Maybe what's confusing here is "true durability" but most people want to know that when data is committed that they can reason about the durability of that data using something like a basic MTBF formula - that is, your durability is "X computers of Y total have to fail at the same time, at which point N data loss occurs". They expect that as the number Y goes up, X goes up too.

When your system doesn't do things like fsync, you can't do that at all. X is 1. That is not what people expect.

Most people probably don't require X == Y, but they may have requirements that X > 1.

by staticassertion

12/10/2025 at 6:42:56 PM

For the vast majority of applications a rare event of data loss is no big deal and even expected.

by otabdeveloper4

12/12/2025 at 4:27:29 AM

I think you're still not getting my point. Yes, a rare event of data loss may not be a big deal. What is a big deal is being able to reason about how rare that event is. When you have durable raft you can reason by using straightforward MTBF calculations. When you don't, you can keep adding nodes but you can't use MTBF anymore because a single failure is actually sufficient to cause data loss.

by staticassertion

12/8/2025 at 8:12:39 PM

I always wondered why the fsync has to be lazy. It seems like the fsync's can be bundled up together, and the notification messages held for a few millis while the write completes. Similar to TCP corking. There doesn't need to be one fsync per consensus.

by millipede

12/8/2025 at 9:20:02 PM

Yes, good call! You can batch up multiple operations into a single call to fsync. You can also tune the number of milliseconds or bytes you're willing to buffer before calling `fsync` to balance latency and throughput. This is how databases like Postgres work by default--see the `commit_delay` option here: https://www.postgresql.org/docs/8.1/runtime-config-wal.html

by aphyr

12/8/2025 at 10:03:44 PM

> This is how databases like Postgres work by default--see the `commit_delay` option here: https://www.postgresql.org/docs/8.1/runtime-config-wal.html

I must note that the default for Postgres is that there is NO delay, which is a sane default.

> You can batch up multiple operations into a single call to fsync.

Ive done this in various messaging implementations for throughput, and it's actually fairly easy to do in most languages;

Basically, set up 1-N writers (depends on how you are storing data really) that takes a set of items containing the data to be written alongside a TaskCompletionSource (Promise in Java terms), when your stuff wants to write it shoots it to that local queue, the worker(s) on the queue will write out messages in batches based on whatever else (i.e. tuned for write size, number of records, etc for both throughput and guaranteeing forward progress,) and then when the write completes you either complete or fail the TCS/Promise.

If you've got the right 'glue' with your language/libraries it's not that hard; this example [0] from Akka.NET's SQL persistence layer shows how simple the actual write processor's logic can be... Yeah you have to think about queueing a little bit however I've found this basic pattern very adaptable (i.e. queueing op can just send a bunch of ready-to-go-bytes and you work off that for threshold instead, add framing if needed, etc.)

[0] https://github.com/akkadotnet/Akka.Persistence.Sql/blob/7bab...

by to11mtm

12/8/2025 at 10:08:56 PM

Ah, pardon me, spoke too quickly! I remembered that it fsynced by default, and offered batching, and forgot that the batch size is 0 by default. My bad!

by aphyr

12/8/2025 at 10:31:58 PM

Well the write is still tunable so you are still correct.

Just wanted to clarify that the default is still at least safe in case people perusing this for things to worry about, well, were thinking about worrying.

Love all of your work and writings, thank you for all you do!

by to11mtm

12/9/2025 at 12:32:16 AM

In some contexts (interrupts) we would call this "coalescing." (I don't work in databases, can't comment about terminology there.)

by loeg

12/8/2025 at 11:27:24 PM

That was my immediate thought as well, under the assumption the lazy fsync is for performance. I imagine in some situations, delaying the write until the write confirmation actually happens is okay (depending on delay), but it also occurred to me that if you delay enough, and you have a busy enough system, and your time to send the message is small enough, the number of open connections you need to keep open can be some small or large multiple of the amount you would need without delaying the confirmation message to actual write time.

by kbenson

12/8/2025 at 10:53:15 PM

In practice, there must be a delay (from batching) if you fsync every transaction before acknowledging commit. The database would be unusably slow otherwise.

by senderista

12/9/2025 at 7:48:11 PM

Right, I think the lazy thing implies that it would happen post "commit" being returned to the client, but it doesn't need to be. The commit just needs to be wait for "an" fsync call, not its own.

by millipede

12/8/2025 at 9:19:06 PM

One of the perks of being distributed, I guess.

The kind of failure that a system can tolerate with strict fsync but can't tolerate with lazy fsync (i.e. the software 'confirms' a write to its caller but then crashes) is probably not the kind of failure you'd expect to encounter on a majority of your nodes all at the same time.

by mrkeen

12/8/2025 at 10:23:43 PM

It is if they’re in the same physical datacenter. Usually the way this is done is to wait for at least M replicas to fsync, but only require the data to be in memory for the rest. It smooths out the tail latencies, which are quite high for SSDs.

by johncolanduoni

12/9/2025 at 12:35:21 AM

> It smooths out the tail latencies, which are quite high for SSDs.

I'm sorry, tail latencies are high for SSDs? In my experience, the tail latencies are much higher for traditional rotating media (tens of seconds, vs 10s of milliseconds for SSDs).

by loeg

12/9/2025 at 1:18:34 AM

They’re higher relative to median latencies for each. A high end SSD’s P99/median is higher than a high end HDD. That’s the relevant metric for request hedging.

by johncolanduoni

12/9/2025 at 2:43:19 AM

It's approximately a factor of 1000x for both.

by loeg

12/8/2025 at 10:51:00 PM

You can push the safety envelope a bit further and wait for your data to only be in memory in N separate fault domains. Yes, your favorite ultra-reliable cloud service may be doing this.

by senderista

12/8/2025 at 7:43:53 PM

> To have better performance in benchmarks

Yes, exactly.

by thinkharderdev

12/8/2025 at 8:54:22 PM

Massively improves benchmark performance. Like 5-10x

by dilyevsky

12/8/2025 at 9:01:25 PM

/dev/null is even faster.

by speedgoose

12/8/2025 at 9:15:03 PM

/dev/null tends to lose a lot more data.

by formerly_proven

12/8/2025 at 10:00:01 PM

Just wait until the jepsen report on /dev/null. It's going to be brutal.

by onionisafruit

12/8/2025 at 10:41:19 PM

/dev/null works according to spec, can't accuse it of not doing something it has never promised

by orthoxerox

12/8/2025 at 11:00:13 PM

durability through replication and distribution and better throughput to build up more within the window on a lazy fsync

by cnlwsu

12/8/2025 at 11:30:21 PM

Curious about the differences between content on aphyr.com/tags/jepsen and jepsen.io/analyses. I recently discovered aphyr.com and was excited about the potential insights!

by mysfi

12/9/2025 at 12:16:42 AM

Jepsen started as a personal blog series in nights and weekends; jepsen.io is when I started doing it professionally, about ten years ago.

by aphyr

12/9/2025 at 8:03:32 AM

Curious : do you have a team of people working with you, or is it mostly solo work ? your work is so valuable, i would be scared for our industry if it had a bus factor of 1.

by bsaul

12/9/2025 at 7:23:44 AM

Highly recommend you check out the interview series they are a lot of fun.

> They will refuse, of course, and ever so ashamed, cite a lack of culture fit. Alight upon your cloud-pine, and exit through the window. This place could never contain you.

https://aphyr.com/posts/340-reversing-the-technical-intervie...

by andersmurphy

12/8/2025 at 11:49:50 PM

Half-expected tbh, but didn’t expect to be this bad.

Just use redpanda.

by dangoodmanUT

12/8/2025 at 8:20:30 PM

> > You can force an fsync after each messsage [sic] with always, this will slow down the throughput to a few hundred msg/s.

Is the performance warning in the NATS possible to improve on? Couldn't you still run fsync on an interval and queue up a certain number of writes to be flushed at once? I could imagine latency suffering, but batches throughput could be preserved to some extent?

by maxmcd

12/8/2025 at 8:29:51 PM

> Is the performance warning in the NATS possible to improve on? Couldn't you still run fsync on an interval and queue up a certain number of writes to be flushed at once? I could imagine latency suffering, but batches throughput could be preserved to some extent?

Yes, and you shouldn't even need a fixed interval. Just queue up any writes while an `fsync` is pending; then do all those in the next batch. This is the same approach you'd use for rounds of Paxos, particularly between availability zones or regions where latency is expected to be high. You wouldn't say "oh, I'll ack and then put it in the next round of Paxos", or "I'll wait until the next round in 2 seconds then ack"; you'd start the next batch as soon as the current one is done.

by scottlamb

12/9/2025 at 6:22:43 AM

Yes, this is a reasonably common strategy. It's how Cassandra's batch and group commit modes work, and Postgres has a similar option. Hopefully NATS will implement something similar eventually.

by ADefenestrator

12/8/2025 at 7:41:26 PM

NATS is a fantastic piece of software. But doc’s unpractical and half backed. That’s a shame to be required to retro engineer the software from GitHub to know the auth schemes.

by clemlesne

12/8/2025 at 8:02:36 PM

[flagged]

by belter

12/8/2025 at 8:16:56 PM

"PostgreSQL used fsync incorrectly for 20 years"

https://archive.fosdem.org/2019/schedule/event/postgresql_fs...

It did not prevent people from using it. You won't find a database that has the perfect durability, ease of use, performance ect.. It's all about tradeoffs.

by Thaxll

12/8/2025 at 8:33:19 PM

Realistically speaking, postgresql wasn’t handling a failed call to fsync, which is wrong: but materially different from a bad design or errors in logic stemming from many areas.

Postgresql was able to fix their bug in 3 lines of code, how many for the parent system?

I understand your core thesis (sometimes durability guarantees aren’t as needed as we think) but in postgresql’s case, the edge was incredibly thin. It would have had to have been: a failed call to fsync and a system level failure of the host before another call to fsync (which are reasonably common).

It’s far too apples to oranges to be meaningful to bring up I am afraid.

by dijit

12/8/2025 at 8:48:32 PM

NATS allows you to fsync every calls, it's not just the default value.

by Thaxll

12/8/2025 at 8:45:07 PM

NATS was originally made for simple, fast, ephemeral messaging.

The persistence stuff is kinda new and it's not a surprise that there are limitations and bugs.

You should see this report as a good thing, as it will add pressure for improvements.

by mring33621

12/8/2025 at 10:12:45 PM

> The persistence stuff is kinda new and it's not a surprise that there are limitations and bugs.

It's not really that new. The precursor to JetStream was NATS Streaming Server [1], which was first tagged almost 10 years ago [2].

[1] https://github.com/nats-io/nats-streaming-server

[2] https://github.com/nats-io/nats-streaming-server/releases/ta...

by njuw

12/8/2025 at 8:06:27 PM

do you have a better solution?

as they would say, NATS is a terrible message bus system, but all the others are worse

by hurturue

12/8/2025 at 10:33:58 PM

Pulsar can do most of what NATS can, but at a much higher cost in both compute and operations (though I haven’t seen a head-to-head of each with durability turned on), along with some simply different characteristics (like NATS being suitable for sidecar deployment). NATS is fantastic for ephemeral messaging, but some of this report is really concerning when JetStream has been shipping for years.

by johncolanduoni

12/8/2025 at 8:16:46 PM

Are RabbitMQ's durable queues worse?

by adhamsalama

12/8/2025 at 8:37:12 PM

Interested to know if you found these issues yourself or from a source. Is Kafka any more robust?

by cedws

12/8/2025 at 8:39:04 PM

Redpanda is https://jepsen.io/analyses/redpanda-21.10.1

by rockwotj

12/8/2025 at 9:42:48 PM

This is just a tl;dr of the article with a mean-spirited barb added.

by tptacek

12/8/2025 at 9:22:09 PM

NATS is ephemeral. if you can accept that, then you'll be fine.

by KaiserPro

12/8/2025 at 7:57:57 PM

Thanks, those reports are always a quiet pleasure to read even if one is a bit far from the domain.

by gostsamo

12/8/2025 at 8:16:10 PM

> By default, NATS only flushes data to disk every two minutes, but acknowledges operations immediately. This approach can lead to the loss of committed writes when several nodes experience a power failure, kernel crash, or hardware fault concurrently—or in rapid succession (#7564).

I am getting strong early MongoDB vibes. "Look how fast it is, it's web-scale!". Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.

Coordinated failures shouldn't be a novelty or a surprise any longer these days.

I wouldn't trust a product that doesn't default to safest options. It's fine to provide relaxed modes of consistency and durability but just don't make them default. Let the user configure those themselves.

by rdtsc

12/8/2025 at 9:18:53 PM

I don't think there is a modern database that have the safest options all turned on by default. For instance the default transaction model for PG is read commited not serializable

One of the most used DB in the world is Redis, and by default they fsync every seconds not every operations.

by Thaxll

12/9/2025 at 7:29:18 AM

SQLite is alway serializable and by default has synchronous=Full so fsync on every commit.

The problem is it has terrible defaults for performance (in the context of web servers). Like just bad options legacy options not ones that make it less robust. Ie cache size ridiculously small, temp tables not in memory, WAL off so no concurrent reads/writes etc.

by andersmurphy

12/9/2025 at 4:18:17 AM

CockroachDB is serializable by default, but I don’t know about their other settings.

by hxtk

12/9/2025 at 12:29:44 PM

FoundationDB provides strict serializability by default.

by jwr

12/8/2025 at 9:44:36 PM

Pretty sure SQL Server won't acknowledge a write until its in the WAL (you can go the opposite way and turn on delayed durability though.)

by hobs

12/8/2025 at 9:33:55 PM

I don't know about Jetstream, but redis cluster would only ack writes after replicating to a majority of nodes. I think there is some config on standalone redis too where you can ack after fsync (which apparently still doesn't guarantee anything because of buffering in the OS). In any case, understanding what the ack implies is important, and I'd be frustrated if jetstream docs were not clear on that.

by lubesGordi

12/9/2025 at 5:02:54 AM

At least per the Redis docs, clusters acknowledge writes before they're replicated: https://redis.io/docs/latest/operate/oss_and_stack/managemen...

The docs explicitly state that clusters do not provide strong consistency and can lose acknowledged data.

by akshayshah

12/9/2025 at 1:35:29 AM

To the best of my knowledge, Redis has never blocked for replication, although you can configure healthy replication state as a prerequisite to accept writes.

by sk5t

12/8/2025 at 9:02:49 PM

NATS is very upfront in that the only thing that is guaranteed is the cluster being up.

I like that, and it allows me to build things around it.

For us when we used it back in 2018, it performed well and was easy to administer. The multi-language APIs were also good.

by KaiserPro

12/8/2025 at 9:21:19 PM

> NATS is very upfront in that the only thing that is guaranteed is the cluster being up.

Not so fast.

Their docs makes some pretty bold claims about JetStream....

They talk about JetStream addressing the "fragility" of other streaming technology.

And "This functionality enables a different quality of service for your NATS messages, and enables fault-tolerant and high-availability configurations."

And one of their big selling-points for JetStream is the whole "stora and replay" thing. Which implies the storage bit should be trustworthy, no ?

by traceroute66

12/8/2025 at 9:24:51 PM

oh sorry I was talking about NATS core. not jetstream. I'd be pretty sceptical about persistence

by KaiserPro

12/8/2025 at 9:45:30 PM

the OP was specifically about jetstream so i guess you just didn't read it?

by billywhizz

12/8/2025 at 10:19:01 PM

just imagine I'm claude,

smoke bomb

by KaiserPro

12/15/2025 at 1:31:15 PM

He Vanished!? Find him you fools!

by nickpeterson

12/8/2025 at 9:08:55 PM

> Well, if you don't fsync, you'll go fast, but you'll go even faster piping customer data to /dev/null, too.

The trouble is that you need to specifically optimize for fsyncs, because usually it is either no brakes or hand-brake.

The middle-ground of multi-transaction group-commit fsync seems to not exist anymore because of SSDs and massive IOPS you can pull off in general, but now it is about syscall context switches.

Two minutes is a bit too too much (also fdatasync vs fsync).

by gopalv

12/8/2025 at 11:00:13 PM

IOPS only solves throughput, not latency. You still need to saturate internal parallelism to get good throughput from SSDs, and that requires batching. Also, even double-digit microsecond write latency per transaction commit would limit you to only 10K TPS. It's just not feasible to issue individual synchronous writes for every transaction commit, even on NVMe.

tl;dr "multi-transaction group-commit fsync" is alive and well

by senderista

12/8/2025 at 8:43:27 PM

Not flushing on every write is a very common tradeoff of speed over durability. Filesystems, databases, all kinds of systems do this. They have some hacks to prevent it from corrupting the entire dataset, but lost writes are accepted. You can often prevent this by enabling an option or tuning a parameter.

> I wouldn't trust a product that doesn't default to safest options

This would make most products suck, and require a crap-ton of manual fixes and tuning that most people would hate, if they even got the tuning right. You have to actually do some work yourself to make a system behave the way you require.

For example, Postgres' isolation level is weak by default, leading to race conditions. You have to explicitly enable serialization to avoid it, which is a performance penalty. (https://martin.kleppmann.com/2014/11/25/hermitage-testing-th...)

by 0xbadcafebee

12/8/2025 at 10:00:14 PM

> Filesystems, databases, all kinds of systems do this. They have some hacks to prevent it from corrupting the entire dataset, but lost writes are accepted.

Woah, those are _really_ strong claims. "Lost writes are accepted"? Assuming we are talking about "acknowledged writes", which the article is discussing, I don't think it's true that this is a common default for databases and filesystems. Perhaps databases or K/V stores that are marketed as in-memory caches might have defaults like this, but I'm not familiar with other systems that do.

I'm also getting MongoDB vibes from deciding not to flush except once every two minutes. Even deciding to wait a second would be pretty long, but two minutes? A lot happens in a busy system in 120 seconds...

by TheTaytay

12/9/2025 at 12:10:43 AM

All filesystems that I'm aware of don't sync to disk on every write by default, and you absolutely can lose data. You have to intentionally enable sync. And even then the disk can still lose the writes.

Most (all?) NoSQL solutions are also eventual-consistency by default which means they can lose data. That's how Mongo works. It syncs a journal every 30-100 ms, and it syncs full writes at a configurable delay. Mongo is terrible, but not because it behaves like a filesystem.

Note that this is not "bad", it's just different. Lots of people use these systems specifically because they need performance more than durability. There are other systems you can use if you need those guarantees.

by 0xbadcafebee

12/9/2025 at 7:33:09 AM

I'd argue with mongo a lot of people use it because it has fantastic marketing.

https://nemil.com/2017/08/29/the-marketing-behind-mongodb/

by andersmurphy

12/8/2025 at 9:21:21 PM

I think “most people will have to turn on the setting to make things fast at the expense of durability” is a dubious assertion (plenty of system, even high-criticality ones, do not have a very high data rate and thus would not necessarily suffer unduly from e.g. fsync-every-write).

Even if most users do turn out to want “fast_and_dangerous = true”, that’s not a particularly onerous burden to place on users: flip one setting, and hopefully learn from the setting name or the documentation consulted when learning about it that it poses operational risk.

by zbentley

12/9/2025 at 4:25:29 AM

I always think about the way you discover the problem. I used to say the same about RNG: if you need fast PRNG and you pick CSPRNG, you’ll find out when you profile your application because it isn’t fast enough. In the reverse case, you’ll find out when someone successfully guesses your private key.

If you need performance and you pick data integrity, you find out when your latency gets too high. In the reverse case, you find out when a customer asks where all their data went.

by hxtk

12/8/2025 at 9:43:19 PM

In the defense of PG, for better or worse as far as I know, the 'what is RDBMS default' falls into two categories;

- Read Committed default with MVCC (Oracle, Postgres, Firebird versions with MVCC, I -think- SQLite with WAL falls under this)

- Read committed with write locks one way or another (MSSQL default, SQLite default, Firebird pre MVCC, probably Sybase given MSSQL's lineage...)

I'm not aware of any RDBMS that treats 'serializable' as the default transaction level OOTB (I'd love to learn though!)

....

All of that said, 'Inconsistent read because you don't know RDBMS and did not pay attention to the transaction model' has a very different blame direction than 'We YOLO fsync on a timer to improve throughput'.

If anything it scares me that there's no other tuning options involved such as number of bytes or number of events.

If I get a write-ack from a middleware I expect it to be written one way or another. Not 'It is written within X seconds'.

AFAIK there's no RDBMS that will just 'lose a write' unless the disk happens to be corrupted (or, IDK, maybe someone YOLOing with chaos mode on DB2?)

by to11mtm

12/8/2025 at 10:02:52 PM

CockroachDB does Serializable by default

by hansihe

12/8/2025 at 11:26:55 PM

> I -think- SQLite with WAL falls under this

No. SQLite is serializable. There's no configuration where you'd get read committed or repeatable read.

In WAL mode you may read stale data (depending on how you define stale data), but if you try to write in a transaction that has read stale data, you get a conflict, and need to restart your transaction.

There's one obscure configuration no one uses that's read uncommitted. But really, no one uses it.

by ncruces

12/9/2025 at 3:22:20 AM

> NATS only flushes data to disk every two minutes, but acknowledges operations immediately.

Wait, isn't that the whole point of acknowledgments? This is not acknowledgment, it's I'm a teapot.

by wseqyrku

12/9/2025 at 6:13:38 PM

Exactly, it's a teapot. And my point was, it's fine to let the user configure that but shipping it as a default seems fishy. It looks in benchmarks, so that's why they do, just like MongoDB did initially.

by rdtsc

12/8/2025 at 8:37:35 PM

NATS data is ephemeral in many cases anyhow, so it makes a bit more sense here. If you wanted something fully durable with a stronger persistence story you'd probably use Kafka anyhow.

by CuriouslyC

12/8/2025 at 8:50:23 PM

Core nats is ephemeral. Jetstream is meant to be persisted, and presented as a replacement for kafka

by nchmy

12/8/2025 at 9:10:39 PM

> NATS data is ephemeral in many cases anyhow, so it makes a bit more sense here

Dude ... the guy was testing JetStream.

Which, I quote from the first phrase from the first paragraph on the NATS website:

    NATS has a built-in persistence engine called JetStream which enables messages to be stored and replayed at a later time.

by traceroute66

12/8/2025 at 9:00:44 PM

So is MQTT, why bother with NATS then?

by petre

12/8/2025 at 9:15:42 PM

MQTT doesn't have the same semantics. https://docs.nats.io/nats-concepts/core-nats/reqreply request reply is really useful if you need low latency, but reasonably efficient queuing. (making sure to mark your workers as busy when processing otherwise you get latency spikes. )

by KaiserPro

12/8/2025 at 9:21:03 PM

You can do request/reply with MQTT too, you just have to implement more bits yourself, whilst NATS has a nice API that abstracts that away for you.

by RedShift1

12/8/2025 at 9:23:53 PM

oh indeed, and clusters nicely.

by KaiserPro

12/8/2025 at 10:39:47 PM

https://github.com/williamstein/nats-bugs

by williamstein

12/9/2025 at 7:42:56 AM

For example, https://github.com/williamstein/nats-bugs/issues/5 links to a discussion I have with them about data loss, where they fundamentally don't understand that their incorrect defaults lead to data loss on the application side. It's weird.

I got very deep into using NATS last year, and then realized the choices it makes for persistence are really surprising. Another horrible example if that server startup time is O(number of streams), with a big constant; this is extremely painful to hit in production.

I ended up implementing from scratch something with the same functionality (for me as NATS server + Jetstream), but based on socket.io and sqlite. It works vastly better for my use cases, since socketio and sqlite are so mature.

by williamstein

12/9/2025 at 8:36:11 AM

There are many things they don't seem to understand about their own product.

https://github.com/nats-io/nats.rs/issues/1253#issuecomment-...

by PaoloBarbolini

12/9/2025 at 4:05:32 AM

this is absolutely shocking!Does kafka do fsync on every write?

by sreekanth850

12/9/2025 at 4:58:49 AM

No. Redpanda has made a lot of noise about this over the years [0], and Confluent's Jack Vanlightly has responded in a fair bit of detail [1].

[0]: https://www.redpanda.com/blog/why-fsync-is-needed-for-data-s...

[1]: https://jack-vanlightly.com/blog/2023/4/24/why-apache-kafka-...

by akshayshah

12/9/2025 at 5:06:37 AM

I think all modern system even scylla db do commit batch no fsync on every write, you either need throughput or durability both cannot exist together. Only thing what redpanda claim is you have to do replication before fsync so your data is not lost if the written node is dead due to a power failure. this is how scylla and cassandra works, if iam not wrong, so even if a node dead before the batch fsync, replication will be done before fsync from memtable,so other nodes will bring the durability and data loss is no longer true in a replicated setup. single node? obviously 100% data loss. but this is the trade off for a high tps system vs durable single ndoe system brings. its how you want to operate.

by sreekanth850

12/10/2025 at 7:31:23 AM

Similarly in regular SQL systems, the same is achieved by fsyncing to WAL.

by menaerus

12/9/2025 at 8:00:08 AM

the article says no )

by lionkor

12/8/2025 at 11:05:38 PM

If you are looking for a serverless alternative to JetStream, check out https://s2.dev

Pros: unlimited streams with the durability of object storage – JetStream can only do a few K topics

Cons: no consumer groups yet, it's on the agenda

by shikhar

12/8/2025 at 11:40:14 PM

Have you tried running Jepsen against it?

by embedding-shape

12/9/2025 at 12:36:55 AM

We do deterministic simulation testing

https://s2.dev/blog/dst https://s2.dev/blog/linearizability

We have also adopted Antithesis for a more thorough DST environment, and plan to do more with it.

One day we will engage Kyle to Jepsen, too. I'm not sure when though.

by shikhar

12/9/2025 at 1:21:34 PM

I guess that's better than nothing. But now I'm unsure what your original comment was about, if your project doesn't use Jepsen for testing to "prove" it works fine, how is your project relevant to bring up on a submission about a Jepsen test of some other software?

If everyone who was making a database/message queue/whatever distributed system shared their projects on every Jepsen submission, we'd never have any discussions about the actual software in question.

by embedding-shape

12/10/2025 at 5:40:07 AM

It seemed like the kind of Jepsen outcome where folks would be considering alternatives, but yeah maybe it was not appropriate to plug here.

by shikhar

12/9/2025 at 11:27:31 AM

NATS claims to use Antithesis as well, so that's nothing comparatively speaking

by Kinrany

12/9/2025 at 9:00:37 AM

I'm not seeing full self-hosting yet, and "Book a call" link is an instant nope for many techies.

I understand that you need to make money. But you'll have to have a proper self-hosting offering with paid support as well before you're considered, at least by me.

I'm not looking to have even more stuff in the cloud.

by pdimitar

12/10/2025 at 5:35:19 AM

The cloud offering is self-serve, no need to get on a call at all. An open source, self-hosted option is in progress https://github.com/s2-streamstore/s2?tab=readme-ov-file#s2-l...

by shikhar

12/9/2025 at 10:57:05 AM

[dead]

by hmans

12/8/2025 at 9:32:36 PM

nats jetstream vs say redis streams - which one have people found easier to work with ?

by dzonga

12/8/2025 at 10:37:02 PM

When I worked with bounded Redis streams a couple of years ago we had to implement our own backpressure mechanism which was quite tricky to get right.

To implement backpressure without relying on out of band signals (distributed systems beware) you need to have a deep understanding of the entire redis streams architecture and how the the pending entries list, consumers groups, consumers etc. works and interacts to not lose data by overwriting yourself.

Unbounded would have been fine if we could spill to disk and periodically clean up the data, but this is redis.

Not sure if that has improved.

by ViewTrick1002

12/9/2025 at 11:25:47 AM

I don't have a direct comment to add, but after working on the fringes of streams a bit, they've worked as advertised, but the API surface area for them is full of cases where, as you say, you have to kind of internalize the full architecture to really understand what's going on. It can be a bit overwhelming.

by ubercore

12/8/2025 at 7:12:39 PM

Definitely thought this was about aviation for a moment.

by selectodude

12/9/2025 at 1:16:05 AM

Yea! I did a double-take, as in addition to Jeppesen, NATS is something I worked with in the past as a UK NOTAM service.

by the__alchemist

12/8/2025 at 7:57:27 PM

Likewise. It took me a moment to realise Jepsen!== Jeppesen

by Infiltrator

12/8/2025 at 8:53:23 PM

It's named after Carly Rae Jepsen, of 2012 hit single "Call Me Maybe" fame.

by crote

12/9/2025 at 12:38:31 AM

I think Aphyr will insist it isn't actually named after Carly Rae for legal reasons, just a striking coincidence.

by loeg

12/8/2025 at 8:15:06 PM

And NATS being the North Atlantic tracks.

by selectodude

12/8/2025 at 7:26:45 PM

[flagged]

by sam_lowry_

12/9/2025 at 12:14:20 AM

https://news.ycombinator.com/news

by t0i7a1r1a