2/11/2026 at 4:58:08 PM
Hello! Railway founder hereWe'll have a post mortem for this one as we always write post mortems for anything that affects users
Our initial investigation reveals this affects <3% of instances
Apologies from myself + the Team. Any amount of downtime is completely unacceptable
You may monitor this incident here: https://status.railway.com/cmli5y9xt056zsdts5ngslbmp
by justjake
2/11/2026 at 5:12:37 PM
Hi Jake. Appreciate your presence here on HN.This affected a seemingly random set of services across three of my accounts (pro and hobby, depending on if this is for work or just myself.) That ranges from Wordpress to static site hosting to a custom Python server. All of the deployments showed as Online, even after receiving a SIGTERM.
While 3% is 'good', that's an awfully wide range of things across multiple accounts for me, so it doesn't feel like 3% ;) Please publish the post mortem. I am a big fan of Railway but have really struggled with the amount of issues recently. You don't want to get Github's growing rep. Some people are already requesting I move one key service away, since this is not the first issue.
Finally, can I make a request re communication:
> If you are experiencing issues with your deployment, please attempt a re-deploy.
Why can't Railway restart or redeploy any affected service? This _sounds_ like you're requiring 3% of your users to manually fix the issue. I don't know if that's a communication problem or the actual solution, but I certainly had to do it manually, server by server.
by vintagedave
2/11/2026 at 5:27:36 PM
Totally! People who see the impact will likely see more impacted than say, 3% of their services. Not all disruption created equal.We rolled out a change to update our fraud model, and that uses workload fingerprinting
Since, in all likelyhood, your projects are similarly structured, there will be more impacted workloads if the shape of your workloads was in the "false positive" set
Will have more information soon but very valid (and astute) feelings!
by justjake
2/11/2026 at 6:25:38 PM
> We rolled out a change to update our fraud model, and that uses workload fingerprinting> Since, in all likelyhood, your projects are similarly structured...
Thanks for the info. For what it's worth and to inform your retrospective, this included:
* A Wordpress frontend, with just a few posts, minimal traffic -- but one that had been posted to LinkedIn yesterday
* A Docusaurus-generated static site. Completely static.
* A Python server where workload would show OpenAI API usage, with consistent behavioural patterns for at least two months (and, I am strongly skeptical would have different patterns to any hosted service that calls OpenAI.)
These all seem pretty different to me. Some that _are_ similarly structured (eg a second Python OpenAI-using server) were not killed.
Some things come to mind for your post-mortem:
* If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.
* I'm speaking only for myself but I cannot understand what these three services have in common, nor how at least 2/3 of them (Wordpress, static HTML) could seem anything other than completely normal.
* How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_. Invisible SIGTERMS to random containers we find out about the hard way seems the exact opposite of sensible handling of supposedly questionable clients.
by vintagedave
2/11/2026 at 7:58:43 PM
We have more info coming soon but I think the best way to frame this is actually working backwards and then explain how it impacted yours and other services.So Railway (and other cloud providers) deal with fraud near constantly. The internet is a bad and scary place and we spend maybe a third to half of our total engineering cycles just on fraud/up-time related work. I don't wanna give any credit to script kiddies to the hostile nation states but we (and others) are under near and constant bombardment from crap workloads in the form of traffic, or not great CPU cycles, or sometimes more benignly, movie pirating.
Most cloud providers understandably don't like talking about it because ironically, the more they talk about it- the bad actors do indeed get a kick from seeing the chaos that they cause work. Begin the vicious cycle...
This hopefully answers:
> If 3% of your services were affected, does that match your expected fraud rate? That is an awful lot of customers to take down in one go, and you'd want to be very accurate in your modeling. I can't see how you'd plan to kill that many without false positives and negative media.
In our 5 year history, this is the third abuse related major outage. One being a Nation State DDoS, one being coordinated denial. This is the first one where it was a false positive taking down services automatically. We tune it constantly so its not really an issue except when it is.
So- with that background, we tune our boxes of lets say "performance" rules constantly. When we see bad workloads, or bad traffic, we have automated systems that "discourage" that use entirely.
When we updated those rules because we detected a new pattern, and then rolling it out, that's when we nailed the legit users, since this used the abuse pattern, it didn't show on your dash, hence the immediate gaslighting.
Which leads to the other question:
> How or why were customers not notified? I have used services before where if something seemed dodgy they would proactively reach out and say 'tell us if it's legit or in 24 hours it will be shut down' or for something truly bad, eg massive CPU usage affecting other services, they'd kill it right away but would _tell you_.
We don't want to tell fraudulent customers if they are effective or not. For this instance, it was a straight up logic bug on the heuristics match. But we have done this for our existence like black holing illegitimate traffic for example, then ban. We did this because some coordinated actors will deploy, get banned with: "reason" and then they would have backup accounts after they found that whatever they were doing was working. If you knew where to look, sometimes they will brag on their IRCs/Discords.
Candidly, we don't want to be transparent about this, but any user impact like this is the least we can do. Zooming out, macro wise, this is why Discord and other services are leaning towards ID verification. ...and it's hard for people on the non service provider side to appreciate the level of garbage out there in the internet. That said, that is an excuse- and we shovel that so that you can do your job and if we stop you, then thats on us which we own and hopefully do better about.
That said, you and others are understandably miffed (understatement) all we can do is work through our actions to rebuild trust.
by ndneighbor
2/12/2026 at 8:20:32 AM
I appreciate this kind of reply. I think you're well on the way rebuilding trust (with me) by communicating this, and thankyou.by vintagedave
2/11/2026 at 6:11:15 PM
Many questions on their forum are similar to our situation. People wondering if they should restart their containers to get things working again. Worried about if they should do anything, risk losing data if they do anything, or just give everything more time.by iJohnDoe
2/11/2026 at 7:37:15 PM
Lots of concerns about doing a Restart or Redeploy since a lot of people are still offline 4+ hours.Since there hasn't been any responses on the official support forum, maybe this will help someone.
I did a backup of our deployment first and did a Restart (not a Redeploy). Our service came back up thankfully.
Obviously do your own safety check about persistent volumes and databases first.
by iJohnDoe
2/11/2026 at 6:02:57 PM
Second complete outage on railway in 2 months for us (there was also a total outage on December 16th), and many issues with stuck builds and other minor issues in the months before that.Looking to move. It's a bit of hassle to setup coolify and Hetzner but I have lost all trust.
by port3000