3/27/2025 at 2:24:49 PM
(disclaimer: I know OP IRL.)I'm seeing a lot of comments saying "only 2 days? must not have been that bad of a bug". Some thoughts here:
At my current day job, our postmortem template asks "Where did we get lucky?" In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.
Additionally - the author (and his team) triaged, root caused and remediated a JS compiler bug in 2 days. The sheer amount of complexity involved in trying to narrow down where in the browser code this could all be going wrong is staggering. Consider that the reason it took him "only" two days is because he is very, _very_ good at what he does.
by aetimmes
3/27/2025 at 6:02:06 PM
Days-taken-to-fix is kind of a weird measure for how difficult a bug is. It's clearly a factor of a large number of things that's not the bug itself, including experience and whether you have to go it alone or if you can talk to the right people.The bug ticks most of the boxes for a tricky bug:
* Non-deterministic
* Enormous haystack
* Unexpected "1+1=3"-type error with a cause outside of the code itself
Like sure it would have been slower to debug if it took 30 hours of to reproduce, and harder he had to be going down the Niagara falls in a barrel while debugging it, but I'm not quite sure those things quite count.
I had a similar category of bug I was struggling with the other year[1] that was related to a faulty optimization in the GraalVM JVM leading to bizarre behavior in very rare circumstances. If I'd been sitting next to the right JVM engineers over at Oracle I'm sure we'd figured it out in days and not the weeks it took me.
by marginalia_nu
3/27/2025 at 4:07:00 PM
I'd love to see the rest of your postmortem template! I never thought about adding a "Where did we get lucky?" question.I recently realized that one question for me should be, "Did you panic? What was the result of that panic? What caused the panic?"
I had taken down a network, and the device led me down a pathway that required multiple apps and multiple log ins I didn't have to regain access. I panicked and because the network was small, roamed and moved all devices to my backup network.
The following day, under no stress, I realized that my mistake was that I was scanning a QR code 90 degrees off from it's proper orientation. I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation. Then it was simple to gain access to that device. I couldn't even replicate the other odd path.
by seeingnature
3/27/2025 at 5:27:37 PM
One of my favorite man pages is scan_ffs https://man.openbsd.org/scan_ffs The basic operation of this program is as follows:
1. Panic. You usually do so anyways, so you might as well get it over with. Just don't do anything stupid. Panic away from your machine. Then relax, and see if the steps below won't help you out.
2. ...
by somat
3/27/2025 at 4:43:53 PM
The standard SRE one recommended by Google has a lucky section. We tend to use it to talk about getting unlucky too.by srejk
3/27/2025 at 6:08:49 PM
A good section to have is one on concept/process issues you encountered, which I think is a generalization of your question about panic.For instance, you might be mistaken about the operation of a system in some way that prolongs an outage or complicates recovery. Or perhaps there are complicated commands that someone pasted in a comment in a Slack channel once upon a time and you have to engage in gymnastics with Sloogle™ to find them, while the PM and PO are requesting updates. Or you end up saving the day because of a random confluence of rabbit holes you'd traversed that week, but you couldn't expect anyone else on the team to have had the same flash of insight that you did.
That might be information that is valuable to document or add to training materials before it is forgotten. A lot of postmortems focus on the root cause, which is great and necessary, but don't look closely at the process of trying to stop the bleeding.
by nathan_douglas
3/27/2025 at 5:48:51 PM
No, QR codes are auto-orienting[1]. If you're getting a different reading at different orientations, there is a bug in your scanner.by parliament32
3/27/2025 at 7:16:49 PM
It does seem to be possible to design QR codes that scan differently depending on the orientation, though they look a little visibly malformed.https://hackaday.com/2025/01/23/this-qr-code-leads-to-two-we...
by egypturnash
3/27/2025 at 5:52:34 PM
> I didn't realize that QR codes had a proper orientation and figured that their corner identifiers handled any orientation.Same, I assumed they were designed to always work. I suspect it was whatever app or library you were using that wasn't designed to handle them correctly.
by Suppafly
3/27/2025 at 3:52:14 PM
Imagine if you weren't working at Google and were trying to convince the Chromium team you found a bug in V8. That'd probably be nigh-impossible.One thing I notice is that Google has no way whatsoever to actually just ask users "hey, are you having problems?", a definite downside of their approach to software development where there is absolutely no communication between users and developers.
by ivraatiems
3/30/2025 at 9:42:57 AM
I think you could, but you'd need a very convincing bug report.by saagarjha
3/27/2025 at 7:15:03 PM
I suspect that by minimising someone else’s work it allows the commenters to feel better about themselves. As a general rule/perspective.by jbs789
3/27/2025 at 5:30:59 PM
> In this instance, the author definitely got lucky that they were working at Google where 1) there were enough users to generate this Heisenbug consistently and 2) that they had direct access to Chrome devs.I'm not sure this is really luck.
The fix is to just not use Math.abs. If they didn't work at Google they still would've done the same debugging and used the same fix. Working at Google probably harmed them as once they discovered Math.abs didn't work correctly they could've just immediately used `> 0` instead of asking the chrome team about it.
There's nothing lucky about slowly adding printf statements until you understand what the computer is actually doing; that's just good work.
by lesuorac