10% of Firefox crashes are caused by bitflips

3/5/2026 at 7:06:43 AM

I've told this story before on HN, but my biz partner at ArenaNet, Mike O'Brien (creator of battle.net) wrote a system in Guild Wars circa 2004 that detected bitflips as part of our bug triage process, because we'd regularly get bug reports from game clients that made no sense.

Every frame (i.e. ~60FPS) Guild Wars would allocate random memory, run math-heavy computations, and compare the results with a table of known values. Around 1 out of 1000 computers would fail this test!

We'd save the test result to the registry and include the result in automated bug reports.

The common causes we discovered for the problem were:

- overclocked CPU

- bad memory wait-state configuration

- underpowered power supply

- overheating due to under-specced cooling fans or dusty intakes

These problems occurred because Guild Wars was rendering outdoor terrain, and so pushed a lot of polygons compared to many other 3d games of that era (which can clip extensively using binary-space partitioning, portals, etc. that don't work so well for outdoor stuff). So the game caused computers to run hot.

Several years later I learned that Dell computers had larger-than-reasonable analog component problems because Dell sourced the absolute cheapest stuff for their computers; I expect that was also a cause.

And then a few more years on I learned about RowHammer attacks on memory, which was likely another cause -- the math computations we used were designed to hit a memory row quite frequently.

Sometimes I'm amazed that computers even work at all!

Incidentally, my contribution to all this was to write code to launch the browser upon test-failure, and load up a web page telling players to clean out their dusty computer fan-intakes.

by netcoyote

3/5/2026 at 8:36:04 AM

I didn't expect to read bits of GW story here from one of the founders - thanks!

by pndy

3/5/2026 at 1:53:19 AM

Very interesting. The Go toolchain has an (off by default) telemetry system. For Go 1.23, I added the runtime.SetCrashOutput function and used it to gather field reports containing stack traces for crashes in any running goroutine. Since we enabled it over a year ago in gopls, our LSP server, we have discovered hundreds of bugs.

Even with only about 1 in 1000 users enabling telemetry, it has been an invaluable source of information about crashes. In most cases it is easy to reconstruct a test case that reproduces the problem, and the bug is fixed within an hour. We have fixed dozens of bugs this way. When the cause is not obvious, we "refine" the crash by adding if-statements and assertions so that after the next release we gain one additional bit of information from the stack trace about the state of execution.

However there was always a stubborn tail of field reports that couldn't be explained: corrupt stack pointers, corrupt g registers (the thread-local pointer to the current goroutine), or panics dereferencing a pointer that had just passed a nil check. All of these point to memory corruption.

In theory anything is possible if you abuse unsafe or have a data race, but I audited every use of unsafe in the executable and am convinced they are safe. Proving the absence of data races is harder, but nonetheless races usually exhibit some kind of locality in what variable gets clobbered, and that wasn't the case here.

In some cases we have even seen crashes in non-memory instructions (e.g. MOV ZR, R1), which implicates misexecution: a fault in the CPU (or a bug in the telemetry bookkeeping, I suppose).

As a programmer I've been burned too many times by prematurely blaming the compiler or runtime for mistakes in one's own code, so it took a long time to gain the confidence to suspect the foundations in this case. But I recently did some napkin math (see https://github.com/golang/go/issues/71425#issuecomment-39685...) and came to the conclusion that the surprising number of inexplicable field reports--about 10/week among our users--is well within the realm of faulty hardware, especially since our users are overwhelmingly using laptops, which don't have parity memory.

I would love to get definitive confirmation though. I wonder what test the Firefox team runs on memory in their crash reporting software.

by adonovan

3/4/2026 at 9:09:06 PM

A 5 part thread where they say they're "now 100% positive" the crashes are from bitflips, yet not a single word is spent on how they're supposedly detecting bitflips other than just "we analyze memory"?

by thegrim33

3/4/2026 at 9:53:49 PM

> last year we deployed an actual memory tester that runs on user machines after the browser crashes.

He doesn't explain anything indeed but presumably that code is available somewhere.

by tredre3

3/5/2026 at 10:13:45 AM

What a pointless comment.

by thatguy27

3/5/2026 at 2:56:33 PM

The simplest way to do this, what I believe memtest86 and friends do, is to write a fixed pattern over a region of memory and then read it back later and see if it changed; then you write patterns that require flipping the bits that you wrote before, and so on.

Things like [1] will also tell you that something corrupted your memory, and if you see a nontrivial (e.g. lots of bits high and low) magic number that has only a single bit wrong, it's probably not a random overwrite - see the examples in [2].

There's also a fun prior example of experiments in this at [3], when someone camped on single-bit differences of a bunch of popular domains and examined how often people hit them.

edit: Finally, digging through the Mozilla source, I would imagine [4] is what they're using as a tester when it crashes.

[1] - https://github.com/mozilla-firefox/firefox/commit/917c4a6bfa...

[2] - https://bugzilla.mozilla.org/show_bug.cgi?id=1762568

[3] - https://media.defcon.org/DEF%20CON%2019/DEF%20CON%2019%20pre...

[4] - https://github.com/mozilla-firefox/firefox/blob/main/toolkit...

by rincebrain

3/5/2026 at 3:10:19 PM

That would tell you if there's a bitflip in your test, but not if there's a bitflip in normal program code causing a crash, no? IIUC GP's questions was how do they actually tell after a crash that that crash was caused by a bitflip.

by rendaw

3/5/2026 at 3:20:09 PM

The example I gave in there is of adding sentinel values in your data, so you can check the constants in your data structures later and go "oh, this is overwritten with garbage" versus "oh, this is one or two bits off". I would imagine plumbing things like that through most common structures is what was done there, though I haven't done the archaeology to find out, because Firefox is an enormous codebase to try and find one person's commits from several years ago in.

by rincebrain

3/5/2026 at 7:15:16 PM

It sounds like they don't know that the crashes are from bitflips but those crashes are from people with flaky memory which probably caused the crash?

by hexyl_C_gut

3/4/2026 at 9:54:15 PM

I'm glad to see somebody is getting some data on this, I feel bad memory is one of the most underrated issues in computing generally. I'd like to see a more detailed writeup on this, like a short whitepaper.

by kdklol

3/5/2026 at 2:30:06 AM

It is rumored heavily on HN that when the first employee of Google, Craig Silverstein was asked about his biggest regret, he said: "Not pushing for ECC memory."

by camkego

3/5/2026 at 2:45:14 AM

It's true that in the very early days Google used cheap computers without ECC memory, and this explains the desire for checksums in older storage formats such as RecordIO and SSTable, but our production machines have used ECC RAM for a long time now.

by adonovan

3/5/2026 at 3:12:06 AM

One of the nicest guys I have met. Was an intern at Google at that time, firing off mapreduces then (2003-2004) was quite a blast. The Peter Weinberger theme T-shirt too.

by srean

3/4/2026 at 9:56:44 PM

> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects! If I subtract crashes that are caused by resource exhaustion (such as out-of-memory crashes) this number goes up to around 15%.

Crashes caused by resource exhaustion are still software bugs in Firefox. At least on sane operating systems where memory isn't over-comitted.

by tredre3

3/4/2026 at 10:37:17 PM

Memory isn't the only resource.

by LorenPechtel

3/5/2026 at 12:46:43 AM

Also a polite reminder that most of those crashes will be concentrated on machines with faulty memory so the naive way of stating the statistic may overestimate its impact to the average user. For the average user this is the difference between 4/5 crashes are from software bugs and 5/5 crashes are from software bugs, and for a lot of people it will still be 5/5

by conartist6

3/4/2026 at 9:59:04 PM

The next logical step would be to somehow inform users so they could take action to replace the bad memory. I realize this is a challenge given the anonymized nature of the crash data, but I might be willing to trade some anonymity in exchange for stability.

by kmoser

3/4/2026 at 10:52:41 PM

The easy solution for that is to just do that analysis locally... Firefox doesn't submit the full core dumps anyhow for this exact reason and therefore needs to do some preprocessing in any case.

by titaniumtravel

3/5/2026 at 2:32:42 AM

>The next logical step would be to somehow inform users so they could take action to replace the bad memory.

This isn't really feasible: have you looked at memory prices lately? The users can't afford to replace bad memory now.

by shiroiuma

3/5/2026 at 7:56:35 AM

The memory issue may not necessarily be from bad ram, it can also be due to configuration issues. Or rather it may be fixable with configuration changes.

I had memory issues with my PC build which I fixed by reducing the speed to 2800MHZ, which is much lower than its advertised speed of 5600MHZ. Actually looking back at this it might've configured its speed incorrectly in the first place, reducing it to 2800 just happened to hit a multiple of 2 of its base clock speed.

by hiddendoom45

3/5/2026 at 4:23:40 AM

I have two identical computers; if the RAM on one is bad, I can swap out the RAM from another. But thank you for your concern.

by kmoser

3/4/2026 at 11:03:49 PM

Try running two instances of Firefox in parallel with different profiles, then do a normal quit / close operation on one after any use. Demons exist here.

by stnvh

3/4/2026 at 10:49:00 PM

>> In other words up to 10% of all the crashes Firefox users see are not software bugs, they're caused by hardware defects!

I find this impossible to believe.

If this were so all devs for apps, games, etc... would be talking about this but since this is the first time I'm hearing about this I'm seriously doubting this.

>> This is a bit skewed because users with flaky hardware will crash more often than users with functioning machines, but even then this dwarfs all the previous estimates I saw regarding this problem.

Might be the case, but 10% is still huge.

There imo has to be something else going on. Either their userbase/tracking is biased or something else...

by NotGMan

3/5/2026 at 7:13:15 AM

It is huge, but real (see https://news.ycombinator.com/item?id=47258500)

Browsers, videogames, and Microsoft Excel push computers really hard compared to regular applications, so I expect they're more likely to cause these types of errors.

The original Diablo 2 game servers for battle.net, which were Compaq 1U servers, failed at astonishing rates due to their extremely high utilization and consequent heat-generation. Compaq had never seen anything like it; most of their customers were, I guess, banking apps doing 3 TPS.

by netcoyote

3/4/2026 at 10:26:22 PM

is there a way to get the memory tester he mentioned? Is it open source? Once Ram goes bad is there a way or recovering it or is it toasted forever?

by vsgherzi

3/4/2026 at 11:08:36 PM

https://www.memtest86.com/

Errors may be caused by bad seating/contact in the slots or failing memory controllers (generally on the CPU nowadays) but if you have bad sticks they're generally done for.

by vizzier

3/5/2026 at 3:09:34 AM

You can map known-bad memory regions to avoid using them.

https://www.memtest86.com/blacklist-ram-badram-badmemorylist...

by foresto

3/5/2026 at 10:38:32 AM

How many are caused by cosmic radiation bitflips?

by brador

3/4/2026 at 10:28:21 PM

People I think are overindexing on this being about "Bad hardware".

We have long known that single bit errors in RAM are basically "normal" in terms of modern computers. Google did this research in 2009 to quantify the number of error events in commodity DRAM https://static.googleusercontent.com/media/research.google.c...

They found 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year.

At the time, they did not see an increase in this rate in "new" RAM technologies, which I think is DDR3 at that time. I wonder if there has been any change since then.

A few years ago, I changed from putting my computer to sleep every night, to shutting it down every night. I boot it fresh every day, and the improvements are dramatic. RAM errors will accumulate if you simply put your computer to sleep regularly.

by mrguyorama

3/5/2026 at 12:46:17 AM

There is DRAM which is mildly defective but got past QC.

There are power suppliers that are mildly defective but got past QC.

There are server designs where the memory is exposed to EMI and voltage differences that push it to violate ever more slightly that push it past QC.

Hardware isn't "good" or "bad", almost all chips produced probably have undetected mild defects.

There are a ton of causes for bitflips other than cosmic rays.

For instance, that specific google paper you cited found a 3x increase in bitflips as datacenter temperature increased! How confident are you the average Firefox user's computer is as temperature-controlled as a google DC?

It also found significantly higher rates as RAM ages! There are a ton of physical properties that can cause this, especially when running 24/7 at high temperatures.

by jmalicki

3/5/2026 at 4:21:04 AM

It'd be interesting to see how your experience would differ if you put it to sleep at night after switching to ECC RAM.

Unfortunately, not that many consumer platforms make this possible or affordable.

by shiroiuma

3/4/2026 at 10:53:17 PM

470k crashes in a week? Considering how low their market share is, that would suggest every install crashes several times a day... I gotta call bs.

by nubinetwork

3/4/2026 at 11:05:49 PM

Based on what data? According to their reporting they have around 200 Million monthly users, which seems compatible with 470k crashes a week? See <https://data.firefox.com/dashboard/user-activity>

by titaniumtravel

3/4/2026 at 11:16:30 PM

2% worldwide? https://gs.statcounter.com/browser-market-share

Granted, they're probably just as accurate as netcraft. /shrug

by nubinetwork

3/5/2026 at 12:34:09 AM

The nuance here is of cause that there are a bunch of people using multiple browsers. Also I mean there are a lot of people using browsers on the world

by titaniumtravel

3/4/2026 at 11:06:34 PM

For my part I'm not sure I recall a crash having daily driven firefox in quite some time. I'd suspect that the large number of bit errors might be driven by a small number of poor hardware clients.

by vizzier

3/5/2026 at 12:55:21 AM

Wouldn't it be more likely the faulty machines are crashing pretty often.

by pixl97

3/5/2026 at 1:19:12 AM

470k crashes / week

67k crashes / day

claim: "Given # of installs is X; every install must be crashing several times a day"

We'll translate that to: "every install crashes 5 times a day"

67k crashes day / 5 crashes / install

12k installs

Your claim is there's 12k firefox users? Lol

by refulgentis