alt.hn

3/6/2026 at 2:17:01 PM

I dropped our production database and now pay 10% more for AWS

https://alexeyondata.substack.com/p/how-i-dropped-our-production-database

by dsr12

3/6/2026 at 3:36:00 PM

"Instead of going through the plan manually, I let Claude Code run terraform plan and then terraform apply".

Doesn't matter if it was you or the bot running terraform, the whole point of a two-step process is to confirm the plan looks right before executing the apply. Looking at the plan after the apply is already running is insane.

by grizmaldi

3/6/2026 at 3:40:53 PM

Shoot first and ask questions later! Measure nonce and cut thrice!

by bombcar

3/6/2026 at 4:45:28 PM

Surely more and harder leetcode interviews will prevent this from happening

by hnthrow0287345

3/6/2026 at 4:24:05 PM

More like 'Shoot yourself first and then complain out it later!'

by Eddy_Viscosity2

3/6/2026 at 4:12:25 PM

Vibe SRE-ing.

by esafak

3/6/2026 at 4:29:18 PM

I mean it would be nice if the Claude and Codex CLIs had a setting to default to plan mode, every now and then I’m trying to put together a plan, only to realize that it’s not in plan mode and already making changes.

by DrJokepu

3/6/2026 at 5:28:27 PM

You should not, under any circumstances, let an LLM touch the Terraform CLI. It's completely irresponsible to give an error-prone system like an LLM that kind of access.

by bigstrat2003

3/6/2026 at 7:28:14 PM

This is what I can't get over - who in their right mind would _ever_ give an agent enough access to delete prod data?

by colpabar

3/6/2026 at 7:56:33 PM

Someone who should be immediately fired.

by bigfatkitten

3/6/2026 at 7:30:14 PM

This is the purpose of sandbox environments.

by EdNutting

3/6/2026 at 4:31:15 PM

What about

  ~/.claude/settings.json
  {"permissions": {"defaultMode": "plan"}}

by abirch

3/6/2026 at 4:30:23 PM

Claude at least does: add "permissions": { "defaultMode: "plan" } to your settings.json.

I'll note this only applies to new sessions though – if you do /clear and start working on something else it doesn't re-apply plan mode (I kind of wish it did)

by sobjornstad

3/8/2026 at 2:55:05 PM

I mean that sentence is basically your RCA

by Sathwickp

3/6/2026 at 3:02:39 PM

I think people will be quick to engage with the "ai is risky" angle, but the thing that jumps out to me is that you were working against a production state in the first place.

The agent made a mistake that plenty of humans have made. A separate staging environment on real infrastructure goes a long way. Test and document your command sequence / rollout plan there before running it against production. Especially for any project with meaningful data or users.

by oneneptune

3/6/2026 at 6:12:53 PM

AI is like explosions.

There’s a lot of other ways to die, but that one is the most exciting.

by paulddraper

3/6/2026 at 3:25:31 PM

Yeah, even without the AI in the loop, testing a major migration in production is madness. With AI (or a freshly-minted intern) in the loop, it's complete insanity.

Test against staging, produce a script, make the most experienced human review and execute said script against production.

by swiftcoder

3/6/2026 at 4:32:57 PM

Even though a lot of what people with agents is wreckless, they often build their own guillotine in the process too.

Problem #1: He decided to shoehorn two projects into 1 even though Claude told him not to.

Problem #2: Claude started creating a bunch of unnecessary resources because another archive was unpacked. Instead of investigating this despite his "terror" the author let Claude continue and did not investigate.

Problem #3: He approved "terraform destroy" which obviously nukes the DB! It's clear he didn't understand, and he didn't even have a backup!

> That looked logical: if Terraform created the resources, Terraform should remove them. So I didn’t stop the agent from running terraform destroy

by fny

3/6/2026 at 4:41:55 PM

> Problem #3: He approved "terraform destroy" which obviously nukes the DB! It's clear he didn't understand

The biggest danger of agents its that the agent is just as willing to take action in areas where the human supervisor is unqualified to supervise it as in those where it isn't, which is exacerbated by the fact that relying on agents to do work [0] reduces learning of new skills.

[0] "to do work" here is in large part to distinguish use that focuses on the careful, disciplined use of agents as a tool to aid learning which involves a different pattern of use. I am not sure how well anyone actually sticks to it, but at least in principal it could have the opposite effect on learning of trust-the-agent-and-go vibe engineering.

by dragonwriter

3/6/2026 at 7:36:40 PM

His backup plan prior to the event had large obvious issues.

His backup plan after the fact seems suspicious as well because he is making it much harder than it has to be.

Between that and a glance at the home page, it feels like someone doing AI vibe work who is not comfortable in the space they are working.

Who is the intended audience? Other vibe coders? I just think its weird that given his backup solution, he likely asked the AI to create it . whatever hot-wash he did for this event was invalidated.

by twentyfiveoh1

3/6/2026 at 3:24:56 PM

One of the largest things i am learning from these stories (Tangentially Wikipedia story too) is to have backups outside of your own infrastructure with snapshots at 15 minute recovery time preferably when possible

For context it was 2.5 years of data. I can only just imagine the nightmare if things would've turned out even a tiny bit more worse for ya. The nightmare it would've been if snapshot of the production database wouldn't have been found even within the AWS business support.

> I was overly reliant on my Claude Code agent, which accidentally wiped all production infrastructure for the DataTalks.Club course management platform that stored data for 2.5 years of all submissions: homework, projects, leaderboard entries, for every course run through the platform.

by Imustaskforhelp

3/6/2026 at 3:36:15 PM

Wise, but for those with large databases, factoring in the price of egress is important, as it gets price ($0.09 for each GB over 100GB, and that 100GB free tier is spread across your entire AWS workload)

by bdcravens

3/6/2026 at 4:11:56 PM

I understand this but maybe its my personal preference but I prefer working with services which don't charge money for egress/charge very little. So think OVH,netcup,Scaleway,BuyVM,Upcloud and Hetzner and so many others.

So to me, its an non issue. But I definitely understand your point yea if someone's locked in AWS then egress can be brutal, but to me that's even more the reason to not use AWS (Also that usually these services that I have listed are more price effective/better too and most of these companies have decent human support and some/most provide decent SLA guarantees as well and most importantly, with all of this, I would love to support non AWS/GCP/Azure clouds and wish for a less centralized internet anyway)

So its actually a win-win for me to not have to worry about egress costs.

by Imustaskforhelp

3/6/2026 at 8:08:06 PM

Of course. We are also evaluating our own cloud exit strategy. The original article was about and by an org on AWS, so I was going for an apples-to-apples analysis.

by bdcravens

3/7/2026 at 5:41:23 AM

[dead]

by bmd1905

3/6/2026 at 4:36:04 PM

How many users does this website have? It must be relatively tiny.

Why the hell is this anywhere near AWS, or Terraform, or any other PaaS nonsense? I'd wager this thing could be run off a $5 VPS with 30 minutes of setup.

by wackget

3/6/2026 at 4:42:56 PM

Overengineering won. That's not going to go well when paired with the new best practice of not actually learning your tools because the AI will take care of it.

by gdulli

3/6/2026 at 5:27:38 PM

You do it once and then you no longer work there.

by siim_osur

3/6/2026 at 4:41:03 PM

Isn't it a good way to play with all these in a relatively safe way, but still with nonzero stakes?

by nine_k

3/6/2026 at 4:37:07 PM

could have been on sqlite -- backups in s3 or equivalent object storage

but let's over-engineer

by dzonga

3/7/2026 at 4:03:05 PM

terraform is 100% overengineering for some hobby projects

by dmix

3/6/2026 at 6:14:09 PM

So…the less crucial the system, the closer to the metal?

by paulddraper

3/6/2026 at 3:46:52 PM

The author is extremely lucky that support was able to find a snapshot for him after he deleted them all. I worked for AWS for many years and was a customer for years before that, and they were almost never able to recover deleted customer data. And this is on purpose: when a customer asks AWS to delete data, they want to assure the customer that it is, in fact, gone. That’s a security promise.

So the fact that they were able to do it for the author is both winning the lottery and frankly a little concerning.

What bothers me more is that the Terraform provider is deleting snapshots that are related to, but not, the database resource itself. Once a snapshot is made, that’s supposed to be decoupled from the database for infrastructure management purposes. That needs to be addressed IMO.

UPDATE: deleting previous automated snapshots on database instance or cluster deletion is default behavior in RDS; that’s not the TF provider’s fault. However, default RDS behavior on deletion is to create a final snapshot of the DB. Makes me wonder if that’s what support helped the author recover. If so, the author didn’t technically need support other than to help locate that snapshot.

And yes this is an object lesson of why human-in-the-loop is still very much needed to check the work of agents that can perform destructive actions.

by otterley

3/6/2026 at 4:49:06 PM

Having customers delete all their data by mistake and then trying to recover it happens more often then you think. It has become common practice to soft delete at first. Usually 30 days later a hard delete is performed.

by yibers

3/6/2026 at 5:05:32 PM

Oh, I know it happens. Over the years AWS has added functionality across various services to help prevent accidental deletion, but absent some documented behavior to the contrary, when a customer confirms that data is to be deleted, AWS is supposed to make that data completely inaccessible by anyone, including AWS themselves.

I updated my comment above because I have a theory as to what really happened here, and it doesn’t involve support recovering deleted snapshots.

by otterley

3/6/2026 at 3:34:05 PM

Props to sharing this!

> Claude was trying to talk me out of it, saying I should keep it separate, but I wanted to save a bit because I have this setup where everything is inside a Virtual Private Cloud (VPC) with all resources in a private network, a bastion for hosting machines

I will admit that I've also ignored Claude's very good suggestions in the past and it has bitten me in the butt.

Ultimately with great automation becomes a greater risk of doing the worst thing possible even faster.

Just thinking about this specific problem makes me more keen to recommend that people have backups and their production data on two different access keys for terraform setups.

I'm not sure how difficult that is I haven't touched terraform in about 7 years now, wow how time flies.

by kami23

3/6/2026 at 5:35:15 PM

Bit of a story of negligence, ignorance, and laziness. I can't say I have much of any sympathy. There were multiple steps that they could have intervened and chose not to.

Good story of what not to do though

by nusl

3/6/2026 at 4:03:01 PM

You should never let Claude manage data in this way. You should if anything have Claude come up with a plan that you manually execute. I get why you would go this path but its pure laziness, and in any normal environment where you weren't the owner you would be terminated and potentially sued for negligence.

by sealthedeal

3/6/2026 at 3:19:18 PM

Oh, the missing Terraform state file.

I haven't used Terraform in anger, but when I experimented with it I was scared about the scenario that happened to the original poster.

I thought "it's a footgun but sure I will not execute commands blindly like that", but in the world of clankers seems like this can happen easily.

by yomismoaqui

3/6/2026 at 4:08:39 PM

In a previous job we used Terraform pretty heavily. I never got good at it, because it felt confusing, dangerous, and unnecessarily complicated for our use. More than once we saw that Terraform wanted to delete critical, stateful resources.

I get that the state file is probably some form of optimization, but it seems like a fairly broken concept. A friend of mine still use Terraform daily, and it's probably weekly he encounters Terraform wanting to do stupid shit.

Honestly if I never have to use Terraform ever again, I'd be pretty happy.

by mrweasel

3/6/2026 at 6:54:12 PM

ive only used cloudformation, but things like deletion protection, and the hug of death are quite nice to have for making things feel safer.

at least with my organization of a separate stack for {network, data, and compute}

cloudformation would refuse to just delete the data base until you first tore down the api that uses it, and while that would still make an outage, you dont lose data before knowing something is wrong.

by 8note

3/6/2026 at 4:58:40 PM

First let the agent do everything and wrong. But why then continue to use the agent to analyze the problem? That would have been the time to stop using Claude.

And why use an agent at all? For some IaC terraform runs?

What is the problem nowadays that people rather prefer to use non-deterministic actions from an agent instead of the very deterministic cli invocations needed?

I guess these people don’t deserve better. Darwin Award winners.

by ugiox

3/6/2026 at 6:42:36 PM

Again the same crying dev baby that did not make backups, blaming AI on the issue. Idiocracy is happening right before our eyes.

by HackerThemAll

3/6/2026 at 3:54:17 PM

I've used Claude and AWS CDK to build infra code during past year, it is great help but it is not to be trusted. I would not even consider it for Ralph Wiggum Loop style iteration or let alone allowing it to run `cdk deploy` or `cdk destroy`. It can generate decent looking constructs, but it comes up values for you like serverlessV2MinCapacity or sometimes it creates resources I don't need. It can end up costing a lot if you then deploy something you didn't expect to.

Since running destroy and deploy also takes a long time, gets stuck, throws weird errors etc, one still needs to read the docs for many things and understand the constructs it outputs.

by Ciantic

3/6/2026 at 6:47:46 PM

ive had it write some good cdk, but only as a one off project. havent tried any maintenance, but the deployment of infrastructure should also go through CI/CD, so the only thing i could destroy is a local playground

i did have to fight it to build the right thing - it wanted to spend something like $100/month but what i had in mind should have been <1, and i eventually got it there.

something i found handy prompt wise was to keep asking claude to predict the monthly cost after builds

by 8note

3/6/2026 at 4:22:06 PM

No consequences for Claude, only consequences for the human who put their faith in it.

by nozzlegear

3/7/2026 at 3:55:25 AM

There are so many mistakes being made here:

- Not using remote state management (setting up an S3 backend is easy and you're already in AWS!)

- Allowing an AI agent to execute against your production environment (especially with no guardrails)

- Not confirming the plan (which I _could_ excuse if one's pipeline is mature enough)

- Not confirming the resources Claude identified automatically before letting it delete things

- Combining 2 projects into the same state.

These mistakes are so horribly egregious that I feel second-hand embarrassment.

by x3n0ph3n3

3/6/2026 at 4:15:21 PM

There will probably be some yolo startups that deploy write-only code to production with unreviewed terraform plans -- who knows this could be disruptive -- but I'm also certain this won't be the last such story.

---

All that being said: it's kind of sad because terraform is fairly declarative and the plans are fairly high-level.

Hence, terraform files and plans are the stuff you should review.

Where as a bunch of imperative code implementing CRUD with fancy UI might be the kind of spaghetti code that's hard to review.

by jopsen

3/6/2026 at 3:32:37 PM

If you delete all your backups, AWS maintains shadow backups they can restore? Is that right?

by jpalmer

3/6/2026 at 3:41:18 PM

You have to contact the NSA and ask for the shadow backup ;)

by bombcar

3/6/2026 at 4:30:14 PM

Don’t count on it. I’m more than a little surprised that support was able to do this, and if I still worked at AWS I’d be raising questions internally.

by otterley

3/6/2026 at 3:34:06 PM

You should always use Object Lock with compliance mode on your S3 backups. Always.

by testplzignore

3/6/2026 at 6:36:20 PM

I'm amazed at how some people are willing to tell the world about making incredibly stupid mistakes like this. The user he was using should NOT have had delete permissions.

by UltraSane

3/6/2026 at 4:55:03 PM

I'm cool with blogging about your fuck-ups, but honestly, not really. Is "I'm incompetent" a good content strategy? Your product is a thousand bucks a year. I'm not going near it. But that's just me?

by 01284a7e

3/6/2026 at 4:53:53 PM

Back in my day, we didn't need AI to accidentally drop production databases.

by jvolkman

3/6/2026 at 11:53:28 PM

Also why regular backups are necessary. Glad they helped in this case.

With great power…

by etothet

3/6/2026 at 4:11:06 PM

this seems like a lot of moving parts for 2M rows ....

by gorfian_robot

3/6/2026 at 6:01:40 PM

[dead]

by octoclaw

3/6/2026 at 3:51:39 PM

Everyone here firing shots at this guy should try holding their tongues.

You/we are all susceptible to this sort of thing, and I call BS on anyone who says they check every little thing their agent does with the same level of scrutiny as they would if they were doing it manually.

by gneray

3/6/2026 at 3:57:19 PM

Most of us are not using agents to deploy infra to production to begin with?

by Capricorn2481

3/6/2026 at 5:42:47 PM

I'm not susceptible to it because I am not foolish or lazy enough to give the clanker access to my command line. Anyone who does that is begging for trouble and I'm not gonna have much sympathy when they get bitten.

by bigstrat2003

3/6/2026 at 4:01:12 PM

Everyone, even the people who saw the inevitability of this and didn't succumb to offloading their thinking to agents?

They don't even deserve a lot of credit because of how obvious consequences like these would be.

by add-sub-mul-div

3/6/2026 at 4:43:03 PM

> You/we are all susceptible to this sort of thing, and I call BS on anyone who says they check every little thing their agent does with the same level of scrutiny as they would if they were doing it manually.

Why? I do that. I give it broad permissions but I also give it very specific instructions, especially when it's about deleting resources. I work in small chunks and review before committing, and I push before starting another iteration (so that if something goes wrong, I have a good state I can easily restore).

I'm the one with the brain. The LLM can regurgitate a ton of plumbing and save days of sifting through libraries, but it'll still get something wrong because at the core it's still a probabilistic output generator. No matter how good it becomes, it still cannot judge whether it's doing something a human will immediately spot as "stupid".

Interacting with and fixing API calls automatically is something that normally works for me, but allowing the agent to run a terraform destroy is something I'd have never let it execute, I'd have been very specific about that.

by plqbfbv

3/6/2026 at 3:59:04 PM

This is satire right? The real lesson we learned is to actually learn how you infrastructure works and don't blindly run destructive commands in prod, AI or otherwise right?

by stryan

3/6/2026 at 4:17:38 PM

SRE here, why you would let your AI run "tofu plan" for you vs doing it on your own?

This is example of someone who has let AI do too much of their "thinking" for them and it's led to brain rot.

by stackskipton

3/6/2026 at 4:32:52 PM

Having the agent autonomously perform the plan stage is fine; that’s not destructive. It’s the blind application stage without human validation or other safety checks that is the problem.

by otterley

3/6/2026 at 5:09:47 PM

I mean, apply is not destructive without human in the loop if you don't pass in -auto-approve.

In any case, I think spending few seconds typing into your terminal and get yourself in human review mode is mindset improvement if it's not 100% speed optimal.

by stackskipton

3/6/2026 at 5:14:51 PM

Agents are perfectly capable of responding to confirmation prompts. The auto approve flag requirement won’t stop a determined agent if it concludes that’s what the principal desires.

by otterley