4/3/2026 at 5:29:17 AM
I think this is especially problematic (from Part 4 at https://isolveproblems.substack.com/p/how-microsoft-vaporize...):"The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something."
Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features. This can take a long time, and cannot happen without the full support from the management who do not fully understand the problem nor are incentivized to understand it.
by branko_d
4/3/2026 at 6:05:54 AM
This isn't incentivized in corporate environment.Noticed how "the talent left after the launch" is mentioned in the article? Same problem. You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launch. Only big launches matter.
The other corporate problem is that it takes time before the cleanup produces measurable benefits and you may as well get reorged before this happens.
by praptak
4/3/2026 at 6:26:12 AM
This is the root of the issue. For something like Azure, people are nor fungible. You need to retain them for decades, and carefully grow the team, training new members over a long period until they can take on serious responsibilities.But employees are rewarded for showing quick wins and changing jobs rapidly, and employers are rewarded for getting rid of high earners (i.e. senior, long-term employees).
by InsideOutSanta
4/3/2026 at 6:58:39 AM
> For something like Azure, people are nor fungibleWhat I've learned from a decade in the industry is that talent is never fungible in low-demand areas. It's surprisingly hard to find people that "get it" and produce something worthwhile together.
by delusional
4/3/2026 at 8:47:23 AM
I would say "systems design" rather than low-demand.People who can "reduce" a big system to build on a few simple concepts are few and far between. Most people just add more stuff instead.
by silvestrov
4/3/2026 at 10:27:33 AM
I think those people are around, they are just not rewarded by this kind of system. They can propose plans and fixes, they just don't get implemented.by aeonik
4/3/2026 at 10:59:19 AM
“Simplicity is a great virtue but it requires hard work to achieve it and education to appreciate it. And to make matters worse: complexity sells better.” - Edsger Wybe Dijkstraby srirangr
4/3/2026 at 3:31:00 PM
When things become too complicated, no one dares to make new systems. And if you don’t make new systems ofc you have to learn system design the other way around — by fixing every bug of existing systems.by markus_zhang
4/3/2026 at 6:08:28 PM
Simple ain’t Easy- Rich Hickey
by jimbokun
4/3/2026 at 3:04:16 PM
[dead]by Whyachi
4/3/2026 at 11:46:50 AM
There are often retention problems with lean budgets, and after training staff they often do just leave for a more lucrative position.Loyalty will often not be rewarded, as most have seen companies purge decade long senior staff a year before going public.
It is very easy to become cynical about the mythology of silicon valley. =3
by Joel_Mckay
4/3/2026 at 8:29:06 AM
What is a low-demand area?by auggierose
4/3/2026 at 9:00:00 AM
A geographic area where there's not abundant opportunity for software developers. Usually everywhere outside the major metro areas. It was primarily meant to discount experiences from SF or Seattle where I'm sure finding talent is easy enough, assuming you are willing to pay.by delusional
4/3/2026 at 12:14:55 PM
I thought of this not as geographic but in terms of what’s sexy vs not. Low Demand = notby grvdrm
4/3/2026 at 5:31:09 PM
Right, like running a sanitation department for a city. Who wants to do that? No one, but it's pretty important and everyone will raise hell and almost riot when it's not working.by chasd00
4/3/2026 at 9:26:34 PM
Totally. I’m in insurance. So much is unsexy but critical. And that’s where you see a lot of folks churning on core systems, process, etc that makes insurance actually work vs any headline tech/investment/AI stuff. Don’t get me wrong - wins there too. But 22 year old Harvard grads aren’t going for underwriting assistant jobs (to use an example)by grvdrm
4/3/2026 at 3:28:41 PM
This is a human problem. We humans praise the doctors that can put the patients with terminal illnesses alive for extended periods, but ignore those who tell us the principles to prevent getting those illnesses in the first place. We throw flowers and money to doctors who treat cancers, but do we do the same to the ones who tell us principles to avoid cancers? No.The same for MSFT or any other similar problem. Humans only care when the house is on fire — in the modern Capitalism it means the stock goes down 50%, and then they will have the will to make changes.
That’s also why reforms rarely succeeded, and the ones that succeeded usually follows a huge shitstorm when people begged for changes.
by markus_zhang
4/3/2026 at 4:51:32 PM
> Humans only care when the house is on fireIn corporate context it's because that's, in theory, an effective use of resources:
If 20 teams are constantly "there is a huge risk of fire", a lot of mental energy is wasted figuring out how to stack rank those 20 and how real of a fire risk there is. If instead you wait when there is a real fire, you can get the 15 teams actually fixing that one.
In practice, you've probably noticed that the most politics-playing & winning teams are the teams which are really effective at :
1) faking fires
2) exaggerating minor fires
3) moving fast & breaking things on purpose (or at least as a nice side effect) to create more fires in their area of ownership* , and get rewarded with more visibility & headcount to fix those fires.
* As long as they have firm grip of that area... If they don't, they risk having it re-orged to another team.
by 72f988bf
4/5/2026 at 3:52:45 AM
>If instead you wait when there is a real fire, you can get the 15 teams actually fixing that one.In this case, with Microsoft's really amazing revenue stream, a charismatic management team can distort reality for quite some time and convince the right people within the company that there is no fire.
by replyifuagree
4/3/2026 at 6:30:18 PM
Yeah the more "honest" side at least tried to fix it after the fire. The demagogue ones like to fake fire and move fast.by markus_zhang
4/4/2026 at 1:22:03 PM
This is a capitalism problem.If you treat people well and give them the means to survive without trying to wring every red cent you can out of them, they'll be more likely to stick around and keep providing value.
by estimator7292
4/3/2026 at 3:12:44 PM
[dead]by salemh
4/3/2026 at 4:45:11 PM
> You don't get rewarded for cleaning up mess (despite lip service from management) nor for maintaining the product after the launchI have never worked at a shop or on a codebase where "move fast & break things, then fix it later" ever got to the "fix it later" party. I've worked at large orgs with large old codebases where the % of effort needed for BAU / KTLO slowly climbs to 100%. Usually some combination of tech debt accumulation, staffing reduction, and scale/scope increases pushing the existing system to its limits.
This is related to a worry I have about AI. I hear a lot of expectations that we're just going to increase code velocity 5x from people that have never maintained a product before.
So moving faster & breaking more things (accumulating more tech debt) will probably have more rapid catastrophic outcomes for products in this new phase. Then we will have some sort of butlerian jihad or agile v2.
by steveBK123
4/3/2026 at 5:03:51 PM
People are still trying to figure out how to use AI. Right now the meme is it's used by juniors to churn out slop, but I think people will start to recognize it's far more powerful in the hands of competent senior devs.It actually surprised me that you can use AI to write even better code: tell it to write a test to catch the suspected bug, then tell it to fix the bug, then have it write documentation. Maybe also split out related functionality into a new file while you're at it.
I might have skipped all that pre-AI, but now all that takes 15 minutes. And the bonus creating more understandable code allows AI to fix even more bugs. So it could actually become a virtuous cycle of using AI to clean up debt to understand more code.
In fact, right now, we're selling technical debt cleanup projects that I've been begging for for years as "we have to do this so the codebase will be more understandable by AI."
by asdfman123
4/4/2026 at 1:23:00 PM
Having worked on many long-lived projects for 5+ years at big firms, I think theres an aspect of project management being a dark art which will conflict with the hopes & dreams of AI.Developer productivity is notoriously difficult to measure. Even feature velocity, cadence or volume improvements are rarely noticed & acknowledged by users for long. They will always complain about speed and somehow notice slowdowns (and invent them in their head as well).
I once joined a team that was in crises, they couldn’t ship for 6 months due to outages. We stabilized production, put in tests, introduced better SDLC, and started shipping every 1-2 weeks. I swear to you that it was not more than a few months before stakeholders were whinging about velocity again. You JUST had zero, give me a break.
If you get a 3x one-off boost by adopting AI and then that’s the new normal, you’ll be shocked how little they pat you on the back for it. Particularly if some of that 3x is spent on tickets to “make the code easier for AI to understand”, testing, and low priority tickets in the backlog no one had bothered doing previously (seen a lot of these anecdotes). And god help you if your velocity slips after that 3x boost, they will notice the hell out of that.
by steveBK123
4/8/2026 at 8:50:50 AM
Problem is that if you want to be a serious cloud provider, you have to do exactly that. I slowly move my apps off of any Microsoft services, because they tend to be slow and buggy.Also they too often remove features of their products and I have no desire to migrate working stuff because MS wants to move people to other products.
And these tend to be worse in recent times. Exemplary for that is PowerAutomate for me. Theoretically a neat tool that is well integrated into the cloud landscape. Practically you cannot implement reliable workflows with it because of numerous reasons.
> If you’re running production workloads on Azure or relying on it for mission-critical systems, this story matters more than you think.
Well, it doesn't explode, but I really question how reliable some of these systems really are. In my experience, not at all. There was or is some genuinely good engineering below some of these systems, but I think all the buggy fluff build upon it really introduces friction.
by raxxorraxor
4/6/2026 at 6:42:00 AM
Perhaps an important question is: why is it not incentivized in corporate environments?I think, however, that perhaps I'm asking in the wrong arena. Unless there are people here reading this who work in the areas of a corporate environment at the level at which those decisions are made, it would really amount to guessing and stereotypes. Generally, I like to think that just about anyone can grasp that a well-made product will sell better due to its nature. I think that there must be some kind of mutual disconnect between both sides where one continues to see improvements important, and the other fundamentally does not (or does not have a functional means to measure and verify it).
by registeredcorn
4/3/2026 at 7:24:18 PM
Meanwhile, failure to clean up this particular mess was a key factor in losing a trillion dollars in market cap, according to the author.by jimbokun
4/4/2026 at 2:34:37 PM
It’s also a customer problem.In a product where a customer has to apply (or be aware of updates), it’s easier to excite them about new features instead of bug fixes.
Especially for winning over new customers.
If the changelog for a product’s last 5 releases are only bug fixes (or worse “refactoring” that isn’t externally visible), most will assume either development is dead or the product is horribly bug ridden - a bad look either way.
by BobbyTables2
4/3/2026 at 6:48:58 AM
Its a cool talent filter though, if you higher people the set of people that quit on doomed projects and how fast they quit is a real great indicator of technological evaluation skills.by cineticdaffodil
4/3/2026 at 8:27:52 PM
> This isn't incentivized in corporate environment.Course it is. But only by the winners who reward the employees who do the valuable work. Microsoft has all sorts of stupid reasons why they have lots of customers - all basically proxies for their customers' IT staff being used to administrating Microsoft-based systems - but if they mess up the core reasons to use a cloud enough they will fail.
by philipallstar
4/3/2026 at 4:53:10 PM
You do but you then make a career out of it : you become the fixer ( and it can be a very good career , either technical or managerial)by Agingcoder
4/3/2026 at 8:30:30 AM
No joke, I worked at a place where in our copy of system headers we had to #define near and far to nothing. That was because (despite not having supported any systems where this was applicable for more than a decade) there was a set of files that were considered too risky to make changes in that still had dos style near and far pointers that we had to compile for a more sane linear address space. https://www.geeksforgeeks.org/c/what-are-near-far-and-huge-p...Now, I'm just a simple country engineer, but a sane take on risk management probably doesn't prefer de facto editing files by hijacking keywords with template magic compared with, you know just making the actual change, reviewing it, and checking it in.
by monocasa
4/3/2026 at 5:51:27 AM
Once you reach this stage, the only escape is to jump ship. Either mentally or, ideally, truly.You're in an unwinnable position. Don't take the brunt for management's mistakes. Don't try to fix what you have no agency over.
by gherkinnn
4/3/2026 at 6:37:44 AM
unfortunately, what you will find is that unless you get lucky, the next ship is more of the same.The system/management style is ingrained in corporate culture of large-ish companies (i would say if it has more than 2 layers of management from you to someone owning the equity of the business and calling the shots, it's "large").
It stems from the fact that when an executive is bestowed the responsibility of managing a company from the shareholders, the responsibility is diluted, and the agent-principle problem rears their ugly head. When several more layers of this starts growing in a large company, the divergence and the path of least resistance is to have zero trust in the "subordinates", lest they make a choice that is contrary to what their managers want.
The only way to make good software is to have a small, nimble organization, where the craftsman (doing the work) makes the call, gets the rewards, and suffers the consequences (if any). That aligns the agent-principle together.
by chii
4/3/2026 at 6:55:05 AM
Hierachy is the enemy of succeding projects and information flow. The more important and complex hierarchy in a culture the less likely it is to have a working software industry. Germanys and japanese endless :"old vs young, seniority vs new, internal vs external, company wide management vs project local management come to mind. Its guerilla vs army, startup vs company allover..by cineticdaffodil
4/3/2026 at 12:25:23 PM
As someone on DACH space, the internal/external goes to the extreme of not being allowed any company infrastructure used by the internals, including some basic stuff like the coffee machine, or canteen.I had team lunches that only happened, because naturally the team couldn't care less about the regulations, and found workarounds, like meeting by "chance" on the same place, and apparently there were no other set of tables available.
by pjmlp
4/4/2026 at 2:59:07 AM
> I would say if it has more than 2 layers of management from you to someone owning the equity of the business and calling the shots, it's "large"By that metric, my 50 employee company is "large".
by bigstrat2003
4/4/2026 at 6:58:11 AM
well, does this company have more than 2 layers of management? Why do you need that much for only 50 people, instead of enpowering those people to make choices (after training and providing guidance on what makes for a good choice in various circumstances)?by chii
4/3/2026 at 1:09:10 PM
I was once in such a position. I persuaded management to first cover the entire project with extensive test suite before touching anything. It took us around 3 months to have "good" coverage and then we started refactor of parts that were 100% covered. 5 months in the shareholders got impatient and demanded "results". We were not ready yet and in their mind we were doing nothing. No amount of explanation helped and they thought we are just adding superficial work ("the project worked before and we were shipping new features! Maybe you are just not skilled enough?") Eventually they decided to scrap whole thing. Project was killed and entire team sacked.by varispeed
4/3/2026 at 6:28:23 PM
I’m a developer and if a team spent five months only refactoring with zero features added I would fire you too.Refactoring and quality improvements must happen incrementally and in parallel with shipping new features and fixing bugs.
by jimbokun
4/4/2026 at 2:35:47 AM
I'm a director and one of our teams just spent 8 months doing just that and it was totally justified. They're finally coming up for air and the foundation is significantly improved.There's nuance here. Every project/team/org is different.
by bmurphy1976
4/4/2026 at 6:52:46 AM
Welcome to Microsoft! Enjoy the ever-growing backlog of bugs to fix!by eviks
4/3/2026 at 11:21:21 AM
> first cover everything with testsBeware this goal. I'm dealing with the consequences of TDD taken way too far right now. Someone apparently had this same idea.
> management who do not fully understand the problem nor are incentivized to understand it
They are definitely incentivized to understand the problem. However the developers often take it upon themselves to deceive management. This happens to be their incentive. The longer they can hoodwink leadership, the longer they can pad their resume and otherwise play around in corporate Narnia.
It's amazing how far you can bullshit leaders under the pretense of how proper and cultured things like TDD are. There are compelling metrics and it has a very number-go-up feel to it. It's really easy to pervert all other aspects of the design such that they serve at the altar of TDD.
Integration testing is the only testing that matters to the customer. No one cares if your user service works flawlessly with fake everything being plugged into it. I've never seen it not come off like someone playing sim city or factorio with the codebase in the end.
by bob1029
4/3/2026 at 4:19:48 PM
Customers don’t care about your testing at all. They care that the product works.Like most things, the reality is that you need a balance. Integration tests are great for validating complex system interdependencies. They are terrible for testing code paths exhaustively. You need both integration and unit testing to properly evaluate the product. You also need monitoring, because your testing environment will never 100% match what your customers see. (If it does, you’re system is probably trivial, and you don’t need those integration tests anyway.)
by dpark
4/3/2026 at 5:55:11 PM
Integration tests (I think we call them scenario tests in our circles) also only tend to test the happy paths. There is no guarantees that your edge cases and anything unusual such an errors from other tiers are covered. In fact the scenario tests may just be testing mostly the same things as the unit tests but from a different angle. The only way to be sure everything is covered is through fault injection, and/or single-stepping but it’s a lost art. Relying only on automated tests gives a false sense of security.by axelriet
4/3/2026 at 2:52:38 PM
Unit tests are just as important as integration tests as long as they're tightly scoped to business logic and aren't written just to improve coverage. Anything can be done badly, especially if it is quantified and used as a metric of success (Goodhart's law applies).Integration tests can be just as bad in this regard. They can be flakey and take hours, give you a false sense of security and not even address the complexity of the business domain.
I've seen people argue against unit tests because they force you to decompose your system into discrete pieces. I hope that's not the core concern here becuase a well decomposed system is easier to maintain and extend as well as write unit tests for.
by caoilte
4/3/2026 at 4:53:47 PM
The problem with unit tests these days is that AI writes them entirely and does a great job at it. That defeats the purpose of unit tests in the first place since the human doesnt have the patience to review the reams of over-mocked test-code produced by AI.The end-result of this are things like the code leak of claude code presumably caused by ai generated ci/cd packaging code nobody bothered to review since the attitude is: who reviews test or ci/cd code ? If they break big-deal, ai will fix it.
by bwfan123
4/3/2026 at 3:38:16 PM
“Premature abstraction” forced by unit tests can make systems harder to maintain.by senderista
4/3/2026 at 6:17:48 PM
It can but more often it’s the opposite.Code that’s hard to write tests for tends to be code that’s too tightly coupled and lacking proper interface boundaries.
by jimbokun
4/3/2026 at 4:34:50 PM
the problem is people make units too small. A unit is not an isolated class or function. (It can be but usually isn't) a unit is one of those boxes you see on those architecture diagrams.by bluGill
4/3/2026 at 4:15:53 PM
Inability to unit test is usually either a symptom of poor system structure (e.g. components are inappropriately coupled) or an attempt to shoehorn testing into the wrong spot.If you find yourself trying to test a piece of code and it’s an unreasonable effort, try moving up a level. The “unit” you’re testing might be the wrong granularity. If you can’t test a level up, then it’s probably that your code is bad and you don’t have units. You have a blob.
by dpark
4/3/2026 at 9:17:44 PM
If you're writing the tests after writing the code, you're not doing TDD though.by carols10cents
4/3/2026 at 8:36:18 AM
> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugsThe exact same approach is recommended in the book "Working effectively with legacy code" by Michael Feathers, with several techniques on how to do it. He describes legacy code as 'code with no tests'.
by hikarudo
4/3/2026 at 4:08:16 PM
"Show me the incentives, and I will show you the outcomes" - Charlie MungerI once worked in a shop where we had high and inflexible test coverage requirements. Developers eventually figured out that you could run a bunch of random scenarios and then `assert true` in the finally clause of the exception handler. Eventually you'd be guaranteed to cover enough to get by that gate.
Pushing back on that practice led to a management fight about feature velocity and externally publicized deadlines.
by coredog64
4/3/2026 at 3:34:37 PM
It is so hard to test those codebases too. A lot of the time there's IO and implicit state changes through the code. Even getting testing in place, let alone good testing, is often an incredibly difficult task. And no one will refactor the code to make testing easier because they're too afraid to break the code.by staticassertion
4/3/2026 at 6:33:04 AM
> I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something.And that, my friends, is why you want a memory safe language with as many static guarantees as possible checked automatically by the compiler.
by dbdr
4/3/2026 at 12:53:04 PM
Language choices won't save you here. The problem is organizational paralysis. Someone sees that the platform is unstable. They demand something be done to improve stability. The next management layer above them demands they reduce the number of changes made to improve stability.by sidewndr46
4/3/2026 at 3:16:04 PM
Usually this results in approvals to approve the approval to approve making the change. Everyone signed off on a tower of tax forms about the change, no way it can fail now! It failed? We need another layer of approvals before changes can be made!by teeray
4/3/2026 at 1:51:00 PM
Yeah I've seen that move pulled. Funnily enough by an ex-Microsoft manager.by cogman10
4/3/2026 at 9:59:35 AM
Hence the rewrite-it-in-Rust initiative, presumably. Management were aware of this problem at some level but chose a questionable solution. I don't think rewriting everything in Rust is at all compatible with their feature timelines or severe shortages of systems programming talent.by mike_hearn
4/3/2026 at 3:01:40 PM
In a rewrite you can smuggle in a quality liftby cineticdaffodil
4/3/2026 at 1:40:25 PM
I had a memory management problem so I introduced GC/ref counting and now I have a non-deterministic memory management problem.by CoolGuySteve
4/5/2026 at 6:54:35 AM
Ref counting is deterministic. Rust memory management is also deterministic: the memory is freed exactly when the owner of the data gets out of scope (and the borrow checker guarantees at compile time there is no use after that).by dbdr
4/5/2026 at 2:19:35 PM
Cool now use the reference on another thread.by CoolGuySteve
4/5/2026 at 10:21:32 PM
If you would use Rust, you would know that problem is solved too.Rust solves a lot of problems, and introduces others
The promiscuous package management, chiefly. Not unusual for building a imlle programme in Rust brings in 200+ crates, from unknown authors on the Internet...
What could possibly go wrong?
by worik
4/3/2026 at 10:45:33 AM
They could have started with simple Valgrind sessions before moving to Rust though. Massive number of agents means microservices, and microservices are suitable for profiling/testing like that.by bayindirh
4/3/2026 at 12:31:54 PM
Visual Studio has had quite some tooling similar to it, and you can have static analysis turned on all the time.SAL also originated with XP SP2 issues.
Just like there have been toons of tools trying to fix C's flaws.
However the big issue with opt-in tooling is exactly it being optional, and apparently Microsoft doesn't enforce it internally as much as we thought .
by pjmlp
4/3/2026 at 12:38:40 PM
> However the big issue with opt-in tooling is exactly it being optional,That's true, and that's a problem.
> and apparently Microsoft doesn't enforce it internally as much as we thought .
but this, in my eyes, is a much bigger problem. It's baffling considering what Microsoft does as their core business. Operating systems high impact software.
> Visual Studio has had quite some tooling similar to it, and you can have static analysis turned on all the time.
Eclipse CDT, which is not capable as VS, but is not a toy and has the same capability: Always on static analysis + Valgrind integration. I used both without any reservation and this habit paid in dividends in every level of development.
I believe in learning the tool and craft more than the tools itself, because you can always hold something wrong. Learning the capabilities and limits of whatever you're using is a force multiplier, and considering how fierce competition is in the companies, leaving that kind of force multiplier on the table is unfathomable from my PoV.
Every tool has limits and flaws. Understanding them and being disciplined enough to check your own work is indispensable. Even if you're using something which prevents a class of footguns.
by bayindirh
4/3/2026 at 3:36:18 PM
I think the core business of MSFT has always been — building a platform, grab everyone in and seek rent. Bill figured out from 1975 so it has been super successful.OS was that platform but in Azure it is just the lowest layer, so maybe management just doesn’t see it, as long as the platform works and government contracts keep coming in. Then you have a bunch of yes-man engineers (I’m so surprised that any principle engineer, who should be financially free, could push out plans described by the author in this series) who gives the management false hopes.
by markus_zhang
4/3/2026 at 3:54:55 PM
One reason why Windows is a mess, is that Satya sees Azure as actually Azure OS, Windows version of OS/360.Ideally everyone would be using it via services hosted there, with the browser or mobile devices as thin clients.
Just two months ago,
https://blogs.windows.com/windowsexperience/2026/02/26/annou...
by pjmlp
4/3/2026 at 6:10:09 PM
It’s org-dependent. On Windows, SAL and OACR are kings, plus any contraption MSR comes up with that they run on checked-in code and files bugs on you out of the blue :) Different standards.by axelriet
4/3/2026 at 6:38:40 AM
I was waiting for that comment :) Remember that everybody, eventually, calls into code written in C.by axelriet
4/3/2026 at 8:10:00 AM
If 90% of the code I run is in safe rust (including the part that's new and written by me, therefore most likely to introduce bugs) and 10% is in C or unsafe rust, are you saying that has no value?Il meglio è l'inimico del bene. Le mieux est l'ennemi du bien. Perfect is the enemy of good.
by dbdr
4/3/2026 at 8:48:07 AM
That is an unexpected interpretation. Use the best tool for the job, also factoring what you (and your org) are comfortable with.by axelriet
4/4/2026 at 11:47:17 AM
[flagged]by RyujiYasukochi
4/3/2026 at 9:54:18 AM
Depends on which OS we are talking about.I know a few where that doesn't hold, including some still being paid for in 2026.
by pjmlp
4/3/2026 at 8:28:40 AM
If you're sufficiently stubborn, it's certainly possible to call directly into code written in Verilog, held together with inscrutable Perl incantations.High-level languages like C certainly have their place, but the space seems competitive these days. Who knows where the future will lead.
by tux3
4/3/2026 at 10:15:11 AM
If you want something extra spicy, there are devices out there that implement CORBA in silicon (or at least FPGA), exposing a remote object accessible using CORBAby p_l
4/3/2026 at 8:45:55 AM
You didn’t miss the smiley, did you? :)by axelriet
4/3/2026 at 12:41:31 PM
I didn't miss the smiley =)by tux3
4/3/2026 at 2:58:59 PM
It’s worse than that. Eventually everybody calls into code that hits hardware. That is the level that the compiler (ironically?) can no longer make guarantees. Registers change outside the scope of the currently running program all the time. Reading a register can cause other registers on a chip to change. Random chips with access to a shared memory bus can modify the memory that the comipler deduced was static. There be dragons everywhere at the hardware layer and no compiler can ever reason correctly about all of them, because, guess what, rev2 of the hardware could swap a footprint compatible chip clone that has undocumented behavior that. So even if you gave all you board information to the compiler, the program could only be verifiably correct for one potential state of one potential hardware rev.by milesvp
4/3/2026 at 3:31:05 PM
Sure, but eliminating bugs isn't a binary where you either eliminate all of them or it's a useless endeavor. There's a lot of value in eliminating a lot of bugs, even if it's not all of them, and I'd argue that empirically Rust does actually make it easier to avoid quite a large number of bugs that are often found in C code in spite of what you're saying.To be clear, I'm not saying that I think it would necessarily be a good idea to try to rewrite an existing codebase that a team apparently doesn't trust they actually understand. There are a lot of other factors that would go into deciding to do a rewrite than just "would the new language be a better choice in a vaccuum", and I tend to be somewhat skeptical that rewriting something that's already widely being used will be possible in a way that doesn't end up risking breaking something for existing users. That's pretty different from "the language literally doesn't matter because you can't verify every possible bug on arbitrary hardware" though.
by saghm
4/3/2026 at 6:05:26 PM
The hardware only understand addresses and offsets, aka pointers :)by axelriet
4/3/2026 at 6:06:57 PM
All the more reason to have memory safety on top.by mlsu
4/3/2026 at 10:52:59 AM
Did you miss the part that writes about the "all new code is written in Rust" order coming from the top? It also failed miserably.by flohofwoe
4/3/2026 at 12:34:40 PM
That was quite interesting and now I will take another point of view of the stuff I shared previously.However given how Windows team has been anti anything not C++, it is not surprising that it actually happened like that.
by pjmlp
4/3/2026 at 6:17:46 PM
It came from the top of Azure and for Azure only. Specifically the mandate was for all new code that cannot use a GC i.e. no more new C or C++ specifically.I think the CTO was very public about that at RustCon and other places where he spoke.
The examples he gave were contrived, though, mostly tiny bits of old GDI code rewritten in Rust as success stories to justify his mandate. Not convincing at all.
Azure node software can be written in Rust, C, or C++ it really does not matter.
What matters is who writes it as it should be seen as “OS-level” code requiring the same focus as actual OS code given the criticality, therefore should probably be made by the Core OS folks themselves.
by axelriet
4/3/2026 at 6:44:43 PM
I have followed it from the outside, including talks at Rust Nation.However the reality you described on the ground is quite different from e.g. Rust Nation UK 2025 talks, or those being done by Victor Ciura.
It seems more in line with the rejections that took place against previous efforts regarding Singularity, Midori, Phoenix compiler toolchain, Longhorn,.... only to be redone with WinRT and COM, in C++ naturally.
by pjmlp
4/6/2026 at 5:41:29 PM
Because neither C nor C++ creates friction.The whole memory safety chapter is a human problem first and foremost.
Some humans haven’t written a memory-safety bug in decades, but it requires a discipline the recent hire never acquired.
I always advocated fixing issues at their root. Humans write bugs, fix the humans. Somehow this was always regarded as taboo ever since I started at Microsoft in 2013.
by axelriet
4/3/2026 at 8:43:12 PM
May I ask, what kind of training does the new joins of the kernel team (or any team that effectively writes kernel level code) get? Especially if they haven't written kernel code professionally -- or do they ONLY hire people who has written non-trivial amount of kernel code?by markus_zhang
4/5/2026 at 10:48:08 PM
There is no formal training (like bootcamp or classes) but the larger org has extensive documentation (osgwiki) and you are expected to learn and ramp-up by yourself.I don’t think there is any kernel code writing experience requirement but the hiring bar is sky-high, you have to demonstrate that you are a programmer.
by axelriet
4/3/2026 at 9:17:33 AM
Once you reach this stage, honestly the only escape is real escape. Put your papers in and start looking for a job elsewhere, because when they go down, they will go down hard and drag you with them. It's not like you didn't try.by neya
4/4/2026 at 5:56:47 AM
Though this doesn't make much sense on its surface - a bug means something is already broken, and he tells of millions of crashes per month, so it was visibly broken. 100% chance of being broken (bug) > some chance of breakage from fixing it(sure, the value of current and potential bug isn't accounted for here, but then neither is it in "afraid to break something, do nothing")
by eviks
4/4/2026 at 6:16:26 AM
I've experienced a nearly identical scenario where a large fleet of identical servers (Citrix session hosts) were crashing at a "rate" high enough that I had to "scale up" my crash dump collection scripts with automated analysis, distribution into about a hundred buckets, and then per-bucket statistical analysis of the variables. I had to compress, archive, and then simply throw away crash dumps because I had too many.It was pure insanity, the crashes were variously caused by things like network drivers so old and vulnerable that "drive by" network scans by malware would BSOD the servers. Alternatively, successful virus infections would BSOD the servers because the viruses were written for desktop editions of Windows and couldn't handle the differences in the server edition, so they'd just crash the system. On and on. It was a shambling zombie horde, not a server farm.
I was made to jump through flaming hoops backwards to prove beyond a shadow of a doubt that every single individual critical Microsoft security patch a) definitely fixed one of the crash bugs and b) didn't break any apps.
I did so! I demonstrated a 3x improvement in overall performance -- which by itself is staggering -- and that BSODs dropped by a factor of hundreds. I had pages written up on each and every patch, specifically calling out how they precisely matched a bucket of BSODs exactly. I tested the apps. I showed that some of them that were broken before suddenly started working. I did extensive UAT, etc.
"No." was the firm answer from management.
"Too dangerous! Something could break! You don't know what these patches could do!" etc, etc. The arguments were pure insanity, totally illogical, counter to all available evidence, and motived only by animal fear. These people had been burned before, and they're never touching the stove again, or even going into the kitchen.
You cannot fix an organisation like this "from below" as an IC, or even a mid-level manager. CEOs would have a hard time turning a ship like this around. Heads would have to roll, all the way up to CIO, before anything could possibly be fixed.
by jiggawatts
4/4/2026 at 6:25:06 AM
Yeah, long periods of total disfunction get ingrainedThough just to ref my original point
> burned before, and they're never touching the stove again
Except they are sitting on the stove with their asses burning, which cuts all the needed cooling off their heads!
by eviks
4/4/2026 at 9:43:21 AM
The better analogy is that they ran out of the kitchen in a panic, and left the pots on the burners. Some time later there is smoke curling up from under the kitchen door, but they’re used to the burning smell by now so it’s “not that big a deal”.by jiggawatts
4/4/2026 at 2:02:33 PM
> Once you reach this stage, the only escape is to first cover everything with tests and then meticulously fix bugs, without shipping any new features.Isn't this where Oracle is with their DB? Wasn't HN complaining about that?
by bombcar
4/3/2026 at 5:39:57 AM
Or to simplify the product and rebuild.by idorosen
4/3/2026 at 12:26:19 PM
“Rebuild” is also a four-letter word though at this stage too. The customer has a panel of knob-and-tube wiring and aluminum paper-wrapped wire in the house. They want a new hot tub. They don’t want some electrician telling them they need to completely rewire their house first at huge expense, such that they cannot afford the hot tub anymore. They’ll just throw the electrician out and get some kid in a pickup truck (“You’re Absolutely Right Handyman LLC”) to run a lamp cord to their new hot tub. Once the house burns to the ground, the new owners will wire their new construction correctly.by teeray
4/3/2026 at 5:41:50 AM
Exactly. But he’s right about management, first the problem must be acknowledged and that may make some people look bad.by axelriet
4/3/2026 at 9:34:39 AM
writing tests and then meticulously fixing bugs does not increase shareholders' value.by egorfine
4/4/2026 at 8:30:56 PM
Dave Cutler and his team are a clear counter-example. They famously shipped Windows NT with zero known bugs, which clearly brought enormous shareholder value.The problem, of course, is that this sort of thing doesn’t bring value next quarter.
by branko_d
4/3/2026 at 7:23:19 AM
once you reach the stage, the only escape is to give up on it. and move on.somethings are beyond your control and capabilities
by rk06
4/3/2026 at 6:37:03 AM
if the service is so shitty, why are people paying so much fucking money for it?is microsoft committing an accounting fraud?
by doctorpangloss
4/3/2026 at 10:03:01 AM
I worked at a startup that was using Azure. The reason was simple enough - it had been founded by finance people who were used to Excel, so Windows+Office was the non-negotiable first bit of IT they purchased. That created a sales channel Microsoft used to offer generous startup credits. The free money created a structural lack of discipline around spending. Once the startup credits ran out, the company became faced with a huge bill and difficulty motivating people to conserve funds.At the start I didn't have any strong opinion on what cloud provider to use. I did want to do IT the "old fashioned way" - rent a big ass bare metal or cloud VM, issue UNIX user accounts on it and let people do dev/test/ad hoc servers on that. Very easy to control spending that way, very easy to quickly see what's using the resources and impose limits, link programs to people, etc. I was overruled as obviously old fashioned and not getting with the cloud programme. They ended up bleeding a million dollars a month and the company wasn't even running a SaaS!
I ended up with a very low opinion of Azure. Basic things like TCP connections between VMs would mysteriously hang. We got MS to investigate, they made a token effort and basically just admitted defeat. I raged that this was absurd as working TCP is table stakes for literally any datacenter since the 1980s, but - sad to say - at this time Azure's bad behavior was enabled by a widespread culture of CV farming in which "enterprise" devs were all obsessed with getting cloud tech onto their LinkedIn. Any time we hit bugs or stupidities in the way Azure worked I was told the problem was clearly with the software I'd written, which couldn't be "cloud native", as if it was it'd obviously work fine in Azure!
With attitudes like that completely endemic outside of the tech sector, of course Microsoft learned not to prioritize quality.
We did eventually diversify a bit. We needed to benchmark our server software reliably and that was impossible in Azure because it was so overloaded and full of noisy neighbours, so we rented bare metal servers in OVH to do that. It worked OK.
by mike_hearn
4/3/2026 at 8:26:02 PM
"Basic things like TCP connections between VMs would mysteriously hang"This is like a car that can't even get you two blocks from home. Amazing.
by jrl
4/3/2026 at 12:44:13 PM
I have had bad experiences across all major vendors.The main reason I used to push for Azure instead during the last years was the friendliness of their Web UIs, and having the VS Code integration (it started as an Azure product after all).
by pjmlp
4/3/2026 at 2:47:53 PM
Friendliness?VSCode integration out of the box, that I can understand. But I have a really hard time calling Azure UI "friendly". Everything is behind layers of nested pointy-clicky chains with opaque or flat out misleading names.
To make things worse, their APIs also follow the same design. Everything you actually would want to do is behind a long sequence of pointer-chasing across objects and service/resource managers. Almost as if their APIs were built to directly reflect their planned UI action sequences.
by bostik
4/3/2026 at 2:52:59 PM
Yes, some of us grew out of the 1970's approach to command line, unless there is no other way.GCP is the worse some options are only available on the CLI, without any visual feedback on the dashboard.
by pjmlp
4/3/2026 at 12:18:35 PM
Corporate inertia. Sibling comment uses the term "hostage situation" which I admit is pretty apt.Microsoft is an approved vendor in every large enterprise. That they have been approved for desktop productivity, Sharepoint, email and on-prem systems does not enter the picture. That would be too nuanced.
Dealing with a Large Enterprise[tm] is an exercise in frustration. A particular client had to be deployed to Azure because their estimate was that getting a new cloud vendor approved for production deployments would be a gargantuan 18-to-24 month org-wide and politically fraught process.
If you are a large corp and have to move workloads to the cloud (because let's be honest: maintaining your own data centres and hardware procurement pipelines is a serious drag) then you go with whatever vendor your organisation has approved. And if the only pre-approved vendor with a cloud offering is Microsoft, you use Azure.
by bostik
4/3/2026 at 7:15:35 AM
The US government’s experts called Azure “a pile of shit”; they got overruled.https://www.propublica.org/article/microsoft-cloud-fedramp-c...
by rawgabbit
4/3/2026 at 7:42:05 AM
Because Azure customers are companies that still, in 2026 only use Windows. Anyone else uses something else. Turns out, companies like that don't tend to have the best engineering teams. So moving an entire cloud infrastructure from Azure to say AWS, probably is either really expensive, really risky or too disruptive to do for the type of engineering team that Azure customers have. I would expect MS to bleed from this slowly for a long time until they actually fix it. I seriously doubt they ever will but stranger things have happened.by hunterpayne
4/3/2026 at 9:59:17 AM
Turns out outside companies shipping software products aspiring to be the next Google or Apple, most companies that work outside software industry also need software to run their business and they couldn't care less about HN technology cool factor.They use whatever they can to ship their products into trucks, outsourcing their IT and development costs , and that is about it.
by pjmlp
4/3/2026 at 11:56:18 AM
Agreed, though only up to a point. Companies that need software to run their business, need that software to run.When your operations are constantly hampered by Azure outages, and your competitors' are not, you're not going to last if your market is at all competitive. Thankfully for many companies, a lot of markets aren't, I suppose, at least for the actors who have established a successful rent and no longer need to care how their business operations are going.
by Balinares
4/3/2026 at 3:28:12 PM
I have worked at two retail companies where AWS was a no no. They didn't want to have anything depending on a competitor(Amazon). So they went the Azure route.by MyHonestOpinon
4/3/2026 at 8:20:41 AM
CFOs love it because Microsoft does bundle pricing with office. Plus they love to give large credits to bootstrap lock-in.by bradleyjg
4/3/2026 at 3:36:00 PM
You’re assuming the alternatives don’t have just as many issues. There’s been exactly one “whistleblower” who is probably tiptoeing the line of a lawsuit. I wouldn’t assume just because there isn’t a similar disgruntled gcp or aws engineer doesn't mean they don't have similar ways.by tw04
4/3/2026 at 4:08:24 PM
this made me look into how cloud hypervisors actually work on HW level.. they all offload it to custom HW (smart nic, fpga, dpu, etc..). cpu does almost nothing except for tenant work. AWS -> Nitro, Azure -> FPGA, NVIDIA sells DPUs.Here is interactive visual guide if anyone wants to explore - https://vectree.io/c/cloud-virtualization-hardware-nitro-cat...
by functional_dev
4/5/2026 at 7:10:20 AM
VM management does not run on the FPGA; it’s regular Win32 software on Windows, with aspirations to run some equivalent, someday, on the SoC next to the FPGA on the NIC. The programmable hardware is used for network paths and PCIe functions, where it can project NICs and NVMe devices to VMs to bypass software-based, VMBus-backed virtual devices, all of which end up being serviced on the host who controls the real hardware. Lookup SR-IOV for the bypass. So yes, that’s I/O bypass/offload, but the VM management stack offload is a distinct thing that does not require an FPGA, just a SoC.by axelriet
4/3/2026 at 9:10:06 AM
most the upper management of companies who use them have dont have the technical competence to see it. (eg: banks, supermarket chains, manufacturing companies)once they are in, no one likes to admit they made a mistake.
by miyuru
4/3/2026 at 3:56:10 PM
Depending on the space you work in, you have almost no choice at all. If you're building for government then you're going to use Microsoft, almost "end of story".by staticassertion
4/3/2026 at 6:49:21 AM
It’s more of a hostage situation.by fxtentacle
4/5/2026 at 1:31:45 AM
Yeah it’s entirely business people and executives who make these decisions in most companies. Not the ones who use it or implement on it.by llama052
4/3/2026 at 9:14:42 AM
Because the alternatives are also in similar state.AWS or GCP are all pretty crap. You use any of them, any you'll hit just enough rough edges. The whole industry is just grinding out slop, quality is not important anywhere.
I work with AWS on a daily basis, and I'm not really impressed. (Also nor did GCP impress me on the short encounter I had with it)
by fodkodrasz
4/3/2026 at 12:48:12 PM
I don't know about AWS or the rest of GCP, but in terms of engineering, my experience of GCE was at least an entire order of magnitude better than what the article alleges about Azure. Security and reliability were taken extremely seriously, and the quality of the engineering was world-class. I hope it has stayed like this since then. It was a worthwhile thing to experience.by Balinares
4/3/2026 at 3:57:15 PM
This isn't it at all. AWS does not have the same sorts of insane cross-tenancy exploits that Azure has had, for example.The reason that Azure has so many customers is very simply because Azure is borderline mandated by the US government.
by staticassertion