12/8/2025 at 8:57:28 PM
Every time someone builds one of these things and skips over "overcomplicated theory", aphyr destroys them. At this point, I wonder if we could train an AI to look over a project's documentation, and predict whether it's likely to lose commmitted writes just based on the marketing / technical claims. We probably can.by stmw
12/8/2025 at 10:17:59 PM
/me strokes my long grey beard and nodsPeople always think "theory is overrated" or "hacking is better than having a school education"
And then proceed to shoot themselves in the foot with "workarounds" that break well known, well documented, well traversed problem spaces
by awesome_dude
12/8/2025 at 10:52:29 PM
certainly a narrative that is popular among the grey beard crowd, yes. in pretty much every field i've worked on, the opposite problem has been much much more common.by whimsicalism
12/9/2025 at 1:29:33 AM
What fields? Cargo culting is annoying and definitely leads to suboptimal solutions and sometimes total misses, but I’ve rarely found that simply reading literature on a thorny topic prevents you from thinking outside the box. Most people I’ve seen work who were actually innovating (as in novel solutions and/or execution) understood the current SOTA of what they were working on inside and out.by johncolanduoni
12/9/2025 at 9:31:07 AM
I suspect they were more referring to curmudgeons not patching.I was engaged after one of the worlds biggest data leaks. The Security org was hyper worried about the cloud environment, which was in its infancy, despite the fact their data leak was from on-prem mainframe style system and they hadn't really improved their posture in any significant way despite spending £40m.
As an aside, I use NATs for some workloads where I've obviously spent low effort validating whether it's a great idea, and I'm pretty horrified with the report. (=
by ownagefool
12/8/2025 at 11:04:22 PM
what's the opposite problem statement?by _zoltan_
12/8/2025 at 11:33:28 PM
People overly beholden to tried and true 'known' way of addressing a problem space and not considering/belittling alternatives. Many of the things that have been most aggressively 'bitter lesson'ed in the last decade fall into this category.by whimsicalism
12/8/2025 at 11:55:40 PM
Like this bug report?The things that have been "disrupted" haven't delivered - Blockchains are still a scam, Food delivery services are worse than before (Restaurants are worse off, the people making the deliveries are worse off), Taxis still needed to go back and vet drivers to ensure that they weren't fiends.
by awesome_dude
12/9/2025 at 12:21:18 AM
> Blockchains are still a scamDid you actually look at the blockchain nodes implementation as of 2025 and what's in the roadmap? Ethereum nodes/L2s with optimistic or zk-proofs are probably the most advanced distributed databases that actually work.
(not talking about "coins" and stuff obviously, another debate)
by hbbio
12/9/2025 at 12:59:58 AM
> Ethereum nodes/L2s with optimistic or zk-proofs are probably the most advanced distributed databases that actually work.What are you comparing against? Aren't they slower, less convenient, and less available than, say, DynamoDB or Spanner, both of which have been in full-service, reliable operation since 2012?
by otterley
12/9/2025 at 5:03:21 AM
I think they mean big-D "Distributed", i.e. in the sense that a DHT is Distributed. Decentralized in both a logical and political sense.A big DynamoDB/Spanner deployment is great while you can guarantee some benevolent (or just not-malevolent) org around to host the deployment for everyone else. But technologies of this type do not have any answer for the key problem of "ensure the infra survives its own founding/maintaining org being co-opted + enshittified by parties hostile to the central purpose of the network."
Blockchains — and all the overhead and pain that comes with them — are basically what you get when you take the classical small-D distributed database design, and add the components necessary to get that extra property.
by derefr
12/9/2025 at 9:53:14 AM
Ethereum is so good at being distributed than it's decentralized.DynamoDB and Spanner are both great, but they're meant to be run by a single admin. It's a considerably simpler problem to solve.
by hbbio
12/9/2025 at 8:03:53 AM
Which are both systems with a fair amount of theory behind them !by Agingcoder
12/9/2025 at 1:12:28 AM
the big difference is the trust assumption, anyone can join or leave the network of nodes at any timeby drdrey
12/9/2025 at 2:48:25 AM
I think you are being downvoted because Ethereum requires you to stake 32 Eth (about $100k), and the entry queue right now is about 9 days and the exit queue is about 20 days. So only people with enough capital can join the network and it takes quite some time to join or leave as opposed to being able to do it at any time you want.by charcircuit
12/9/2025 at 6:32:54 AM
ok but these are details, the point is that the operators of the database are external, selfish and fluctuatingby drdrey
12/9/2025 at 1:20:55 AM
The traditional way is paper trails and/or WORM (write-once-read-many) devices, with local checksums.You can have multiple replica without extra computation for hash and stuffs.
by j16sdiz
12/9/2025 at 5:54:44 PM
idk, sounds like you're ignoring tried and true microeconomic theoretical principles about consumer surplus. better get back to the books before commentingby whimsicalism
12/8/2025 at 11:18:23 PM
The ivory tower standing in the way of delivering value I think.by MrDarcy
12/8/2025 at 11:32:37 PM
To be more specific, goals of perfection where perfection does not at all matter.by colechristensen
12/9/2025 at 3:54:32 AM
What does bothering to read some distributed systems literature have to do with demanding unnecessary perfection? Did NATS have in their docs that JetStream accepted split brain conditions as a reality, or that metadata corruption could silently delete a topic? You could maybe argue the fsync default was a tradeoff, though I think it’s a bad one (not the existence of the flag, just the default being “false”). The rest are not the kind of bugs you expect to see in a 5 year old persistence layer.by johncolanduoni
12/9/2025 at 4:09:40 AM
Exactly, "losing data from acknowledged writes" is not failing to be perfect, it's failing to deliver on the (advertised) basics of storing your data.by stmw
12/9/2025 at 5:52:42 AM
Last time I was at school requirement analysis was a thing, but do go off.by LaGrange
12/9/2025 at 2:16:55 PM
I don't have a "school education" and I know plenty of theory, I certainly have read the papers cited in this test.by staticassertion
12/9/2025 at 2:39:39 PM
You might not have a school education, but you have educated yourself. It is unfortunately common to hear people complain that the theory one learns in school (or by determined self-study) is useless, which I think is what the geybeard comment you replied to intends to say.by mzl
12/9/2025 at 6:42:43 PM
OK, the real differences between self directed study, and school based study:1. School based is supposed to cover all the basics, self directed you have to know what the basics are, or find out, and then cover them.
2. School based study the teachers/lecturers are supposed to have checked all the available text on the subject and then share the best with the students (the teachers are the ones that ensure nobody goes down unproductive rabbitholes)
3. People can see from the qualifications that a person has met a certain standard, understands the subject, has got the knowledge, and can communicate that to a proscribed level.
Personal note, I have done both in different careers, and being "self taught" I realised that whilst I definitely knew more about one topic in the field than qualified individuals, I never knew what the complete set of study for the field was (i never knew how much they really knew, so could never fill the gaps I had)
In CS I gained my qualification in 2010, when i went to find work a lot of places were placing emphasis on self taught people who were deemed to be more creative, or more motivated, etc. When I did work with these individuals, without fail they were missing basic understanding of fundamentals, like data structures, well known algorithms, and so on.
by awesome_dude
12/9/2025 at 11:51:50 AM
The only post in this thread that actually summarized the core findings of the study, namely:- ACKed messages can be silently lost due to minority-node corruption.
- A single-bit corruption can cause some replicas to lose up to 78% of stored messages
- Snapshot corruption can propagate and lead to entire stream deletion across the cluster.
- The default lazy-fsync mode can drop minutes of acknowledged writes on a crash.
- A crash combined with network delay can cause persistent split-brain and divergent logs.
- Data loss even with “sync_interval = always” in presence of membership changes or partitions.
- Self-healing and replica convergence did not always work reliably after corruption.
…was not downvoted, but flagged... That is telling. Documented failure modes are apparently controversial. Also raises the question: What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?
So what is next? Nominate NATS for the Silent Failure Peace Prize?
by belter
12/9/2025 at 12:43:28 PM
> Nominate NATS for the Silent Failure Peace Prize?One or two of the comments on GitHub by the NATS team in response to Issues opened by Kyle are also more than a bit cringeworthy.
Such as this one:
"Most of our production setups, and in fact Synadia Cloud as well is that each replica is in a separate AZ. These have separate power, networking etc. So the possibility of a loss here is extremely low in terms of due to power outages."
Which Kyle had to call them out on:
"Ah, I have some bad news here--placing nodes in separate AZs does not mean that NATS' strategy of not syncing things to disk is safe. See #7567 for an example of a single node failure causing data loss (and split-brain!)."
https://github.com/nats-io/nats-server/issues/7564#issuecomm...
by traceroute66
12/9/2025 at 9:56:58 PM
> What level of technical due diligence was performed by organizations like Mastercard, Volvo, PayPal, Baidu, Alibaba, or AT&T before adopting this system?I have to note the following as a NATS fan:
- I am horrified at Jespen's reliability findings, however they do vindicate certain design decisions I made in the past
- 'Core NATS' is really mostly 'redis pubsub but better' and Core NATS is honestly awesome, low friction middleware. I've used it as part of eventing systems in the past and it works great.
- FWIW, There's an MQTT bridge that requires Jetstream, but if you're just doing QoS 0 you can work around the other warts.
- If you use Jetstream KV as a cache layer without real persistence (i.e. closer to how one uses Redis KV where it's just memory backed) you don't care about any of this. And again Jetstream KV IMO is better than Redis KV since they added TTL.
All of that is a way to say, I'd bet a lot of them are using Core NATS or other specific features versus something like JetStream.tl;dr - Jetstream's reliability is horrifying apparently but I stand by the statement that Core NATS and Ephermal KV is amazing.
by to11mtm
12/9/2025 at 7:41:04 AM
You can have DeepWiki literally scan the source code and tell you:> 2. Delayed Sync Mode (Default)
> In the default mode, writes are batched and marked with needSync = true for later synchronization filestore.go:7093-7097 . The actual sync happens during the next syncBlocks() execution.
However, if you read DeepWiki's conclusion, it is far more optimistic than what Aphyr uncovered in real-world testing.
> Durability Guarantees
> Even with delayed fsyncs, NATS provides protection against data loss through:
> 1. Write-Ahead Logging: Messages are written to log files before being acknowledged
> 2. Periodic Sync: The sync timer ensures data is eventually flushed to disk
> 3. State Snapshots: Full state is periodically written to index.db files filestore.go:9834-9850
> 4. Error Handling: If sync operations fail, NATS attempts to rebuild state from existing data filestore.go:7066-7072"
https://deepwiki.com/search/will-nats-lose-uncommitted-wri_b...
by PeterCorless
12/9/2025 at 10:31:56 AM
> if you read DeepWiki's conclusion, it is far more optimisticWell, its an LLM ... of course its going to be optimistic. ;-)
by traceroute66
12/9/2025 at 5:00:33 PM
"You are entirely correct!"by PeterCorless
12/9/2025 at 12:01:43 PM
and your point is ...?by 63stack
12/9/2025 at 4:06:56 PM
I don't think they were making a point. Someone suggested using an LLM for this, someone then responded by using an LLM for it.What you draw from that seems entirely up to you. They don't seem to be making any claims or implying anything by doing so, just showing the result.
by staticassertion
12/9/2025 at 5:00:02 PM
Exactly.by PeterCorless
12/9/2025 at 2:17:40 PM
You can DIY without aphyr.by esafak
12/9/2025 at 2:57:38 PM
But this example of DIY led to incorrect conclusions about data integrity.by otterley
12/9/2025 at 5:59:00 PM
It's not even "overcomplicated theory" it's just "commit your writes before you say you committed your writes". It's actually way, way more complicated to try to build a system that tries to be correct without doing that.by staticassertion
12/10/2025 at 2:51:29 AM
You don’t even have to train an AI. At this point, in lieu of evidence to the contrary, we should default to “it loses committed writes”.by asa400
12/8/2025 at 10:46:41 PM
I've asked LLMs to do similar tasks and the results were very useful.by dboreham
12/9/2025 at 1:30:33 AM
I can’t wait until it’s good enough to vibecode the next MongoDB.by johncolanduoni
12/9/2025 at 10:19:17 AM
Aim for all three of CAP to really hit the right vibes.by lnenad