Verified Spec-Driven Development (VSDD)

2/28/2026 at 5:52:27 PM

Everything in this post stems from the assumption that you already know what you're doing, which is probably true for things you've built before. But I hope we can agree that you can't spec out something you have no clue how to build, let alone write the tests before you've even explored the boundaries of the problem space. That's completely unreasonable.

My second point is that this approach is fundamentally wrong for AI-first development. If the cost of writing code is approaching zero, there's no point investing resources to perfect a system in one shot. What matters more is how fast you can explore the edges. You can now spin up five agents to implement five different versions of the thing you're building and simply pick the best one.

In our shop, we have hundreds of agents working on various problems at any given time. Most of the code gets discarded. What we accept to merge are the good parts.

by _pdp_

2/28/2026 at 6:49:14 PM

Nothing of what you write here matches my experience with AI.

Specification is worth writing (and spending a lot more time on than implementation) because it's the part that you can still control, fully read, understand etc. Once it gets into the code, reviewing it will be a lot harder, and if you insist on reviewing everything it'll slow things down to your speed.

> If the cost of writing code is approaching zero, there's no point investing resources to perfect a system in one shot.

THe AI won't get the perfect system in one shot, far from it! And especially not from sloppy initial requirements that leave a lot of edge (or not-so-edge) cases unadressed. But if you have a good requirement to start with, you have a chance to correct the AI, keep it on track; you have something to go back to and ask other AI, "is this implementation conforming to the spec or did it miss things?"

> five different versions of the thing you're building and simply pick the best one.

Problem is, what if the best one is still not good enough? Then what? You do 50? They might all be bad. You need a way to iterate to convergence

by virgilp

3/1/2026 at 7:12:56 AM

Same, I've sorta ended up converging on make a rough plan, get second and third opinions from various AI's on it, sort of decide and make choices while shaping the plan, which we turn into a detailed specsheet. Then follow the 'how to design programs' method which is mostly writing documentation first, then expected outcomes, then tests, then the functions, then test the flow of the pipeline. This usually looks like starting with Claude to write the documentation, expectations and create the scaffolding, then having Gemini write the tests and the code, then have codex try to run the pipeline and fix anything it finds that is broken along the way. I've found this to work fairly well, it's looser than waterfall, but waterfall-ish, but it's also sort of TDD-ish, and knowing that there will be failures and things to fix, but it also sort of knows the overall strategy and flow of how things will work before we start.

by michaelbrave

2/28/2026 at 9:59:00 PM

This. Waterfall never worked for a reason. Humans and agents both need to develop a first draft, then re-evaluate with the lessons learned and the structure that has evolved. It’s very very time consuming to plan a complex, working system up front. NASA has done it, for the moon landing. But we don’t have those resources, so we plan, build, evaluate, and repeat.

by manmal

2/28/2026 at 10:14:41 PM

That "first draft" still has to start with a spec. Your only real choice is whether the spec is an actual part of project documentation with a human in the loop, or it's improvised on the spot within the AI's hidden thinking tokens. One of these choices is preferable to the other.

by zozbot234

3/1/2026 at 8:39:07 AM

I agree, and personally I often start with a spec. However, I haven’t found it useful to make this very detailed. The best ROI I‘ve been getting is closing the loop as tightly as possible before starting work, through very elaborate test harnesses and invariants that help keeping the implementation simple.

I‘d rather spend 50% of my time on test setup, than 20% on a spec that will not work.

by manmal

2/28/2026 at 10:59:39 PM

So, rollback and try again with the insight.

AI makes it cheap to implement complex first drafts and iterations.

I'm building a CRM system for my business; first time it took about 2 weeks to get a working prototype. V4 from scratch took about 5 hours.

by ErrantX

2/28/2026 at 11:10:19 PM

AI is also excellent at reverse engineering specs from existing code, so you can also ask it to reflect simple iterative changes to the code back into the spec, and use that to guide further development. That doesn't have much of an equivalent in the old Waterfall.

by zozbot234

3/1/2026 at 8:42:31 AM

Yeah, if done right. In my experience, such a reimplementation is often lossy, if tests don’t enforce presence of all features and nonfunctional requirements. Maybe the primary value of the early versions is building up the test system, allowing an ideal implementation with that in place.

Or put this way: We’re brute forcing (nicer term: evolutionizing) the codebase to have a better structure. Evolutionary pressure (tests) needs to exist, so things move in a better direction.

by manmal

3/1/2026 at 1:17:01 PM

What matters ultimately is the system achieves your goals. The clearer you can be about that the less the implementation detail actually matters.

For example; do you care if the UI has a purple theme or a blue one? Or if it's React or Vur. If you do that's part of your goals, if not it doesn't entirely matter if V1 is Blue and React, but V4 ends up Purple and Vue.

by ErrantX

3/1/2026 at 1:57:50 AM

are you intentionally being vague here becuase it's a HN comment and you can't be arsed going into detail?

or do you literally type

> Look at the git repo that took us 2 weeks, re-do it in another fresh repo... do better this time.

I think you don't and that your response is intentional misdirection to pointlessly argue against the planning artifact approach.

by NamlchakKhandro

3/1/2026 at 7:32:20 AM

> NASA has done it, for the moon landing.

Which one? The one in 1960s or the one which has just been delayed - again?

I think you can just as well develop a first spec and iterate on than coding up a solution, important is exploration and iteration - in this specific case.

by Towaway69

3/1/2026 at 8:36:38 AM

Iterating on paper in my experience never captures the full complexity that is iteratively created by the new constraints of the code as it‘s being written.

by manmal

3/1/2026 at 3:14:10 AM

> Waterfall never worked for a reason

We're going to need some evidence for this claim. I feel like nearly 70 years of NASA has something to say about this.

by virgil_disgr4ce

3/1/2026 at 8:47:40 AM

While writing the comment, I did think to myself, that NASA did a ton of prototypes to de-risk. They simulated the landing as close as they could possibly make it, on earth. So, probably not pure waterfall either. Maybe my comment was a bit too brusque in that regard.

by manmal

3/1/2026 at 6:58:37 AM

It does say - you will never have the time and resources of NASA

by blabla1224

3/1/2026 at 2:28:15 PM

"Waterfall" was primarily a strawman that the agile salesman made up. Sure, it existed it some form but was not widely practiced.

by osigurdson

3/1/2026 at 12:25:21 PM

You claim to disagreeing with OP but you seem to be describing basically the same core loop of planning and execution.

Doing OODA faster has always been the key thing to creating high quality outcomes.

by __alexs

3/1/2026 at 7:52:03 PM

No, OP literally claims "you can't spec out something you have no clue how to build"; I claim that on the contrary, you absolutely can - you don't need to know "how to build" but you need to clarify what you want to build. You can't ask AI to build something (and actually obtain a good "something") until you can say exactly what the said "something" is.

You iterate, yes - sometimes because the AI gets it wrong; and sometimes because you got it wrong (or didn't say exactly what you wanted, and AI assumed you wanted something else). But the less specific and clear you are in your requirements, the less likely it is you'll actually get what you want. With you not being specific in the requirements, it only really works if you want something that lots of people are building/have built before, because that will allow the AI to make correct assumptions about what to build.

by virgilp

3/1/2026 at 3:07:48 AM

>THe AI won't get the perfect system in one shot, far from it! And especially not from sloppy initial requirements that leave a lot of edge (or not-so-edge) cases unadressed. But if you have a good requirement to start with, you have a chance to correct the AI, keep it on track; you have something to go back to and ask other AI, "is this implementation conforming to the spec or did it miss things?"

This is an antiquated way of thinking. If you ramp up the number of agents you're using the auto-correcting and reviewing behavior kicks in which makes for much less human intervention until the final code review.

by nojito

3/1/2026 at 6:57:52 AM

Yes, but what about the "spec-review"? Isn't that even more important? Is the system doing what we (and its users) need it to be doing?

by galaxyLogic

3/1/2026 at 3:16:06 AM

> You can now spin up five agents to implement five different versions of the thing you're building and simply pick the best one.

Or you end up with five different mediocre solutions where the best parts are randomly distributed amongst all five.

by petersumskas

2/28/2026 at 8:30:11 PM

There’s a real tension here.

If you are vibe-coding, this approach is definitely going to kill you buzz and lose all the rapid iteration benefits.

But if you are working in an existing large system, vibe coding is hard to bring into the core. So I think something more formal like OP is needed to reap major benefits from AI.

by theptip

2/28/2026 at 10:10:32 PM

This is just AI-written slop, but even if you're vibe coding and want to go for rapid iteration, you still benefit by having the AI write out a broad plan of what it's going to do and looking it over before telling it to implement it. One-shot vibe coding is totally worthless, but the more you're aware of what the AI is thinking about and ready to revise its plans, the better it can potentially do.

by zozbot234

3/1/2026 at 5:39:17 AM

> In our shop, we have hundreds of agents working on various problems at any given time. Most of the code gets discarded. What we accept to merge are the good parts.

What you’ve described is an incredibly expensive and inefficient genetic algorithm with a human review as the fitness function. It’s not the flex you might think it is.

by hdhdhsjsbdh

2/28/2026 at 8:53:14 PM

If the price of code is zero then changing the spec also costs zero in terms of code and. This is what always was the problem with specs before. You'd write one, run it through the prover, write the code, then have to throw out the whole thing because there was a business case you didn't account for.

Now the bottom 98% can be given to a robot with a clear success signal other than 'it looks about right'.

by noosphr

2/28/2026 at 9:13:41 PM

code is orthogonal to spec. you can iterate on the code and iterate on the spec. the spec is not meant to be constant, it's a form of ECC for the artifacts of the coding pipeline.

by baq

3/1/2026 at 6:51:35 AM

Exactly.

Also if you want to gain something by being less specific, eg. not writing code, and then want to be specific in writing a spec, then you just switched a precise system for an imprecise one.

by LunicLynx

2/28/2026 at 6:49:22 PM

Thats why I have AI do a write up about the system I want to build, I then review it all. If it looks good I use it as my prompt.

by giancarlostoro

2/28/2026 at 7:33:53 PM

> But I hope we can agree that you can't spec out something you have no clue how to build

Eh, of course you can. You can specify anything as long as you know what you want it to do. This is like systems engineering 101 and people do it successfully all the time.

by zppln

2/28/2026 at 5:58:21 PM

If you don't mind the question with regard to your second point, couldn't what you've done in your shop be also used here? There's no reason why 'try to develop it five different ways and pick the best parts out of each' is incompatible with the 'VSDD' concept; seems like it could be included?

by DaylitMagic

2/28/2026 at 6:09:16 PM

> you can't spec out something you have no clue how to build

Ideally—and at least somewhat in practice—a specification language is as much a tool for design as it is for correctness. Writing the specification lets you explore the design space of your problem quickly with feedback from the specification language itself, even before you get to implementing anything. A high-level spec lets you pin down which properties of the system actually matter, automatically finds an inconsistencies and forces you to resolve them explicitly. (This is especially important for using AI because an AI model will silently resolve inconsistencies in ways that don't always make sense but are also easy to miss!)

Then, when you do start implementing the system and inevitably find issues you missed, the specification language gives you a clear place to update your design to match your understanding. You get a concrete artifact that captures your understanding of the problem and the solution, and you can use that to keep the overall complexity of the system from getting beyond practical human comprehension.

A key insight is that formal specification absolutely does not have to be a totally up-front tool. If anything, it's a tool that makes iterating on the design of the system easier.

Traditionally, formal specification have been hard to use as design tools partly because of incidental complexity in the spec systems themselves, but mostly because of the overhead needed to not only implement the spec but also maintain a connection between the spec and the implementation. The tools that have been practical outside of specific niches are the ones that solve this connection problem. Type systems are a lightweight sort of formal verification, and the reason they took off more than other approaches is that typechecking automatically maintains the connection between the types and the rest of the code.

LLMs help smooth out the learning curve for using specification languages, and make it much easier to generate and check that implementations match the spec. There are still a lot of rough edges to work out but, to me, this absolutely seems to be the most promising direction for AI-supported system design and development in the future.

by tikhonj

2/28/2026 at 5:58:10 PM

"Most of the code gets discarded." If you don't mind sharing, what's your signal-to-token ratio?

by politician

2/28/2026 at 6:55:36 PM

How do you propose we measure signal? Lines of code is renowned for being a very bad measure of anything, and I really can't come up with anything better.

by kvdveer

2/28/2026 at 8:23:29 PM

The OP said that they kept what they liked and discarded the rest. I think that's a reasonable definition for signal; so, the signal-to-token ratio would be a simple ratio of (tokens committed)/(tokens purchased). You could argue that any tokens spent exploring options or refining things could be signal and I would agree, but that's harder to measure after the fact. We could give them a flat 10x multiplier to capture this part if you want.

by politician

2/28/2026 at 9:56:36 PM

I'm going to call it out as bullshit, you can't dig out "what you like" from "hundreds agents running all the time".

by mirekrusin

2/28/2026 at 10:33:16 PM

One of our projects has 1.2K open pull requests.

https://i.postimg.cc/Jnfk9b8g/Xnapper-2026-02-28-22-25-42.pn...

We probably accept 1-2 per day.

I personally discard code for the tiniest of reasons. If something feels off moments after I open the PR, it gets deleted. The reason we still have 1.2K open PRs is because we can't review all of them in time.

The most likely solution is to delete all of them after a month or two. By that time the open PRs on this project alone will be at least 10-20 more.

by _pdp_

3/1/2026 at 4:42:16 AM

Doesn't seem like too efficient process, no? Seems to me like investment in better quality of the output is exactly what is needed here, wouldn't you agree?

by mirekrusin

3/1/2026 at 7:04:46 AM

I feel they sit of on the opposite end of the OP here. One wants to write out specs to control the agent implementation to achieve a one shot execution. Other side says: let’s won’t waste time of humans writing anything.

I’m personally torn. A lot of the spec talk and now here in combination with TDD etc feels like the pipe dreams of the mid 2000. There was this idea of the Architect role who writes UML and specs. And a normal engineer just fills in the gaps. Then there was TDD. Nothing against it personally. But trying to write code in test first approach when you don’t really have a clue how a specific platform/system/library works had tons of overhead. Also the side effect of code written in the most convenient way to be tested and not to be executed. All in all to throw this ideas together for AI now… But throwing tokens out of the window and hoping for the token lottery to generate the best PR is also not the right direction in my book. But somebody needs to investigate in both extremes I say.

by larusso

3/1/2026 at 10:24:13 AM

Actually, nobody said the spec needs to be written by humans.

My personal opinion: with today's LLMs, the spec should be steered by a human because its quality is proportional to result quality. Human interaction is much cheaper at that stage — it's all natural language that makes sense. Later, reasoning about the code itself will be harder.

In general, any non-trivial, valuable output must be based on some verification loop. A spec is just one way to express verification (natural language — a bit fuzzy, but still counts). Others are typecheckers, tests, and linters (especially when linter rules relate to correctness, not just cosmetics).

Personally, on non-trivial tasks, I see very good results with iterative, interactive, verifiable loops:

- Start with a task

- Write spec in e.g. SPEC.md → "ask question" until answer is "ok"/proceed

- Write implementation PLAN.md — topologically sorted list of steps, possibly with substeps → ask question

- For each step: implement, write tests, verify (step isn't done until tests pass, typecheck passes, etc.); update SPEC/PLAN as needed → ask question

- When done, convert SPEC.md and PLAN.md into PR description (summary) and discard

("Ask question" means an interactive prompt that appears for the user. Each step is gated by this prompt — it holds off further progress, giving you a chance to review and modify the result in small bits you can actually reason about.) The workflow: you accept all changes before confirming the next step. This way you get code deltas that make sense. You can review and understand them, and if something's wrong you can modify by hand (especially renames, which editors like VS Code handle nicely) or prompt for a change. The LLM is instructed to proceed only when the re-asked answer is "ok".

This works with systems like VSCode Copilot, not so much with CC cli.

I'm looking forward to an automated setup where the "human" is replaced by an "LLM judge" — I think you could already design a fairly efficient system like this, but for my work LLMs aren't quite there yet.

That said, there's an aspect that shouldn't be forgotten: this interactive approach keeps you in the driving seat and you know what's happening with the codebase, especially if you're running many of these loops per day. Fully automated solutions leave you outside the picture. You'll quickly get disconnected from what's going on — it'll feel more like a project run by another team where you kind of know what it does on the surface but have no idea how. IMO this is dangerous for long-term, sustainable development.

by mirekrusin

2/28/2026 at 9:47:44 PM

A lot of interesting replies below this comment that I won't be able to respond to individually.

I'll just leave this here:

https://en.wikipedia.org/wiki/P_versus_NP_problem

by _pdp_

2/28/2026 at 10:30:03 PM

That seems barely related and settles nothing? Bottom line is simple, saying "you can't spec out something you have no clue how to build" is saying you cannot desire coldness unless you understand how to build a refrigerator. It's just the difference between what and how. If you don't know the difference between implementation and specifications, just try a whole day of answering "what" and "why" questions with "how" answers and see how it goes.

by robot-wrangler

2/28/2026 at 10:42:46 PM

Writing tests for a known solution (verification) is straightforward. But speccing out and testing something you haven't even figured out how to build yet (discovery) is a fundamentally harder problem.

Try speccing out a flux capacitor. I'll wait.

https://chatbotkit.com/reflections/verification-is-easier-th...

by _pdp_

2/28/2026 at 11:21:25 PM

> Try speccing out a flux capacitor. I'll wait.

One way to spec that is presumably something like "X% more efficient than current best-in-class", "made of Y,Z with no exotic materials", "takes no longer than T days to create" and so on.

Anyway, being "anti" spec isn't even wrong because it's just a completely incoherent position. There's always a spec.. including any informal prompt you kick off your agents with. Call it a "structured prompt" if that soothes you and your agents, then let's move on to the interesting part where we decide how much structure is optimal

by robot-wrangler

2/28/2026 at 7:15:25 PM

I’ve gotten the absolute best results from LLMs just acting like the software engineer I’ve aspired to be the past 15 years.

Normal dev things. Scope the ticket properly, break it down. Test well. Write the correct docs.

LLM specific things are going to be gone next week

by Robdel12

2/28/2026 at 8:52:38 PM

I feel this very recently. I just pretend the agent is a junior dev and tell them to do things I don’t have time for. Reviewing the changes at my convenience is a lot like checking in on a junior dev, too. On the other hand I do feel like I get better results with the same teeing up that a junior dev requires, so I try to remove as many unknowns/dependencies as possible (or else explicitly tell it to leave some things as stubs) before sending him off to do something

by twodave

3/1/2026 at 2:30:22 AM

Junior devs who know a lot more maths than I do...

by actionfromafar

3/1/2026 at 9:35:56 PM

VSDD is the most rigorous development pipeline I've seen articulated for AI-native engineering. The purity boundary map in Phase 1b is particularly sharp — making verifiability an architectural constraint rather than an afterthought is exactly right.

But there's a boundary VSDD doesn't cross: the commit boundary into production runtime.

VSDD verifies that the code does what the spec says. It says nothing about whether the output that code generates — at runtime, from a live LLM — is admissible. A formally verified inference pipeline can still produce a clinical summary that omits a contraindication, or a financial disclosure that drifts outside regulatory bounds. The code is correct. The output is not.

The verification architecture ends at deployment. The governance problem begins there.

VSDD and runtime output enforcement aren't competing — they're sequential. You need both. But most teams treat deployment as the finish line when it's actually where the second problem starts.

by entrustai

2/28/2026 at 6:59:13 PM

LLM-assisted development feels a lot like trend-driven development. When dealing with technique and heterogenous prompts and goals, it’s easy to gain somewhat of a gambler’s fallacy with respect to a particular technique.

Spec-driven development feels pretty questionable to me. I’m sure it works fine for feature work that is predictable or has been done before, but then I wonder why you’d waste your time with it.

Prior to LLMs, the whole vibe was to iterate rapidly toward a working thing so you can see what works and what doesn’t. Why would we abandon that strategy as an industry when the cost of writing code is ostensibly getting cheaper?

If I’m using LLMs at all, I’m using them to do a breadth search of prior art or ideas, then I’m doing what I might call a prototype onion: successive clean room attempts at a novel problem, accumulating what I learn at each attempt in each successive prompt. I usually then take the prototype and write the final version myself so I’m properly internalizing the idea.

Ultimately a lot of this prompt work feels like procrastination. It is not about understanding where these tools is useful and where they are not but trying to have them consume every aspect of the work.

by SirensOfTitan

2/28/2026 at 7:11:55 PM

Or maybe people who like talking much more than they like code are now very excited about the possibility that talking has eaten software development.

This is exactly backwards. For many tasks, formal languages are better, more real, more beautiful than English. No matter how many millions of tokens you have, you will never talk the formulas of Fermat, Euler, and Gauss into irrelevance. And the same is true of good code.

Of course, a lot of code is ugly and utilitarian too, and maybe talking will make it less painful to write that stuff.

by getnormality

3/1/2026 at 11:13:11 AM

  > Or maybe people who like talking much more than they like code are now very excited about the possibility that talking has eaten software development.

This has been my experience. In my team, the most excited people appear to be those who spend most of their time in meetings.

by orphea

3/1/2026 at 2:05:44 PM

There was an early era when the AI revolution was led by us Java users who had to deal with boilerplate lol. Plus review 4000 lines of code a day, a lot of it boilerplate as well. This was also the era of "clean code" and we were looking at adapters connected to adapters for no reason besides DRY being a holy rule.

by muzani

2/28/2026 at 9:01:58 PM

> a lot of code is ugly and utilitarian too,

And as everyone who can abstract well knows: Ugly code that have staying power have a reason to be ugly. And the best will be annotated with HACK and NOTE comments. Anything else can be refactored and abstracted to a much better form. But that requires a sense of craftsmanship (to know better) and a time allowance for such activity.

by skydhash

3/1/2026 at 1:12:33 AM

People forget that the code is the spec. Usually it’s more effective to maintain comments.

Heck, given that LLMs are language driven, why not bring literate programming back? Knuth would be proud.

by stingraycharles

2/28/2026 at 8:18:00 PM

Short take: replace TDD with BDD, and might add DDD as a spice. Otherwise this is a fairly good article.

Why not TDD? Since a lot of developers use LLMs to create tests today, plus a lot of the training data contains information on how to do this. Making it something that it either can figure out to do by itself or that it will cheat. Both equally bad.

A somewhat controversial take is that you should simply avoid writing tests which the LLM can produce by itself, similar to how we in the last week removed the agents.md file.

by WestN

3/1/2026 at 5:53:20 AM

Modern testing frameworks already use BDD (behavior driven development , which simply means writing tests in natural language. BDD is a superset of TDD:

    import { describe, it, expect } from 'vitest';

    describe('User Authentication', () => {
      it('should allow login with valid credentials', () => {
        const result = login('user@example.com', 'password123');
          expect(result.isAuthenticated).toBe(true);
  });

This is BDD and every testing framework I've used in my entire career and 95% of the tests I've written look something like that.

by esperent

2/28/2026 at 9:22:23 PM

could you say more about removing agents.md?

by chrisweekly

2/28/2026 at 10:19:03 PM

Probably referencing this: https://news.ycombinator.com/item?id=47034087

by supermdguy

3/1/2026 at 3:18:45 AM

thanks btw, if you're right, that post's top comment seems to make a strong case for keeping agents.md

by chrisweekly

2/28/2026 at 7:41:30 PM

If I am not mistaken, the verification is problematic here. It's run too late.

A piece of code that satisfies a single test will most likely not be probable to adhere to the spec.

Worse, the whole spec can only be correctly implemented in total. You cannot work iteratively by satisfying one constraint after the other. The same holds for the test cases. That means that satisfying the last test or fulfilling the last constraint will take much more work than the first. The number of tests passed is not a good metric for completion of the implementation.

by choeger

3/1/2026 at 11:06:48 AM

Some of the ideas here are codified in a simpler way in my TDD framework, CodeLeash (https://codeleash.dev).

The key difference is CodeLeash puts guardrails outside the model, as Claude Code hooks. It can’t forget or skip steps, it gets forced through a process. There are bypasses but they raise alarms that the model is also forced to review.

Putting guardrails outside the model forces deterministic process following. With a sufficiently capable process (TDD is a powerful way to build software) you can really scale up coding agent usage.

by cadamsdotcom

2/28/2026 at 7:52:03 PM

I've been doing something less formal. I stumbled upon Riaan Zoetmulder's free course on deep learning and medical image analysis [1] and found his article on spec-driven development [2]. He adapts the V-Model by specifying three things upfront: requirements, system design and architecture. The rest gets generated. He mentioned a study where they show that LLM assistance slowed down experienced open source devs on large codebases. The model doesn't know the implicit context. And to me that's the thing! An LLM should have an index of some sort.

So I vibe coded my own static analysis program where I just track my own function calls. It outputs a call graph of all my self-defined functions and shows the name (and Python type hints) of what it is calling (excluding standard library function, also only self-defined stuff). Running that program and sending the diff from time to time seems to have helped a lot already.

[1] https://www.riaanzoetmulder.com/courses/deep-learning-medica...

[2] https://www.riaanzoetmulder.com/articles/ai-assisted-program...

by melvinroest

2/28/2026 at 10:36:47 PM

I am not following, can you give a concrete example of your workflow?

by xcubic

3/1/2026 at 12:56:02 AM

In my agent file I explain that I have a static analyzer which generates a callgraph. On starting the agent runs ~/.agent/tools/__callgraph__/generate_callgraph.py

It then gets to see callgraph.current.md and upon subsequent sessions callgraph.diff.md.

Here is an example of some output that I currently have in callgraph.current.md

  ## src/components/Header.tsx

  - **export Header({ ... }: Props)** (start 9, end 54) → `useAuth`

  ## src/components/HelpTooltip.tsx

  - **export HelpTooltip({ ... }: Props)** (start 15, end 42) → (none)

  ## src/components/ResultsTable.tsx

  - **getHeaderLabel(col: { id: string; columnDef: { header?: unknown } }): string** (start 45, end 51) → (none)
  - **getCellValue(colId: string, original: KeywordResult): string** (start 53, end 74) → (none)
  - **export ResultsTable({ ... }: Props)** (start 76, end 406) → `getCellValue`

  ## src/components/SettingsDrawer.tsx

  - **export SettingsDrawer({ ... }: Props)** (start 158, end 336) → (none)

For example:

ResultsTable calls getCellValue.

In these cases it's just one function but you also have stuff like

  **export Dashboard()** (start 36, end 635) → `getTopKeywords`, `normalizeText`, `searchKeywords`, `searchKeywordsMulti`, `searchSemantic`, `searchSemanticMulti`

For the Python version it also gives the parameters and types along with it. I think the next thing I'd need to do is give self-defined type definitions. Doing things this way allows an LLM to not read all that much but to be able to reason relatively well over what the code does. The caveat is that you abstracted your code well. If you didn't, the LLM doesn't know your implementation.

I probably should also add return types.

by melvinroest

2/28/2026 at 5:55:03 PM

Some random (hopefully additive and helpful) thoughts:

Many companies have older code bases / databases that can be somewhat well defined (and somewhat not). If things have been slowly iterating over 35 years, there's a lot of undocumented edge behavior that may occur; it may be beneficial to have a step before Edge Case Catalog where there's some kind of prompting to catalogue how the inputs and outputs work, and then find the different inputs and outputs - and then confirm that with Input A and Output A that it works as expected. (Legacy systems often have weird orchestration that nobody remembers.)

(Sub-note: This is somewhat part of the provable properties catalog; while this step could be placed there, it would require a re-run of edge case catalog build potentially, which isn't a bad thing.)

A small note that I personally think is a good idea is better code commenting than has been outlined here - the spec itself should be woven into the code with potentially slightly over-commenting for each aspect, code spec gets lost. The code itself should serve as context, especially in the TDD stage.

I think it's implicit but may be worth overtly stating that for the Code Quality check in Phase 3 that it also checks on a zero-trust basis, and doesn't include things like hardcoded keys.

I'm not sure what Chainlink is (sorry!) but I like the ideas outlined around the decomposition - but it misses stringing everything together end-to-end in the way outlined here (it asks to create each part, but never actually weaves the whole together).

Something not covered - is sequencing work and decomposition of work. A spec can create multiple dependencies within itself, requiring things to be worked on in a specific order.

by DaylitMagic

2/28/2026 at 7:01:56 PM

If you come up with a strategy that seems to "solve programming", then you know for certain there must be a flaw in it, and you need to identify where it is that corners must be cut and how.

Computer science is an introspective discipline because it studies the essential difficulty of problems regardless of the process taken to solve them, and programming itself (i.e. the problem of producing a correct, or correct-enough program) is such a problem that can be, and has been studied. The question of learning whether a program X satisfies some correctness property P is known as the model-checking problem, and we know that answering it with certainty is intractable. For example, some properties that are true for some program would take no less than 10 minutes to verify (regardless of how that verification is done), others will take no less than 10 hours, others no less than 10 months, others no less than 10 years and so on, and we don’t know ahead of time whether the proprty is true, and if it is, where on this spectrum it falls.

So suppose you decide some property must be proven with full certainty, the question becomes, how long do you wait before giving up waiting for the validation and what do you do when you give up? If you then decide that you’re okay with less than 100% confidence, what approach do you take and how much confidence do you actually have? The problem with that is that the answer to that question often requires a deep understanding of the implementation. I.e. if you have two programs, X and Y, that compute the same function, one less-than-perfect approach would give you 99% confidence with one of them, but only 10% confidence with another.

by pron

2/28/2026 at 7:08:13 PM

More, “If your LLM comes up with a strategy to ‘solve programming’ …”

by twoodfin

3/1/2026 at 3:45:19 AM

Doing something similar with https://github.com/mikeyobrien/ralph-orchestrator

It's much more than a simple Ralph loop.

by mobrienv

3/1/2026 at 4:14:26 PM

So we are cycling back to kind of Waterfall Development ? No more agility ? The trend is dead by AI ?

by hamdouni

3/1/2026 at 10:31:01 AM

I've implemented a version of this in rlm-workflow: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow

by dworks

2/28/2026 at 8:28:33 PM

In a perfect world I can see this happening. But with AI increasing or output I think the real bottleneck is work sponsors, business logic and requirements discovery/translating/sign off.

I am seeing more teams and features being rolled faster than before but then discovering that the sponsors (those requesting features and change) either don’t invest the time up front or with timely feedback loops and work stalls or has to be redone as business does not see the results until it’s either live or about to go live.

This has always been the case but I think AI tooling has moved the bottleneck

by daveac

3/1/2026 at 5:40:15 AM

- you most certainly also wanna read this one https://boristane.com/blog/the-software-development-lifecycl...

by vivzkestrel

2/28/2026 at 10:32:19 PM

Its an interesting direction if you see it under the umbrella of diminishing costs: You build a product once with vibe coding and a design/ product hat. Once you know what works you rebuild it 100% in a framework like this. You do this every time from scratch when the tech debt or the mismatch between architecture and needs are too big.

by jFriedensreich

3/1/2026 at 11:00:21 AM

You could also use the same framework always - that's what I'm doing anyway. But you gotta remember that no matter how well you spec it, first iteration of the specs is going to suck anyway.

But you vibe-code it anyway and see what happens. You'll start noticing obvious issues that you can track back to something in spec.

Then you throw away the entire thing (the entire project!) and start from scratch. Repeat until you have something you like.

Incremental specing doesn't work though. You need a clean room approach with only important learnings from previous iterations. Otherwise agent will never pick a hard but correct path.

by gck1

3/1/2026 at 3:06:47 AM

I do exactly this. The database schema won’t change as often.

by rsrsrs86

2/28/2026 at 9:24:05 PM

I expected formal verification to be part of this. That could not be fooled and is rock-solid, unless you cheat in your specification. Swap your AI "verifier" out for that and I'm on board.

by teiferer

2/28/2026 at 8:08:37 PM

You cannot escape from the human verifying the properties you want verified mechanically. This only gives you leverage in specific scenarios where specification is much simpler than the implementation.

by alpaylan

2/28/2026 at 6:52:14 PM

I would like to be enlightened myself if RPI,BMAD or any spec-driven approaches actually worked for any mid/large scale projects, without wasting millions of tokens of course :)

by vielite1310

3/1/2026 at 3:08:08 AM

Real spec driven uses no tokens

by rsrsrs86

2/28/2026 at 5:31:27 PM

Nice. It can work with something like https://github.com/github/spec-kit ?

by sjbr

2/28/2026 at 7:19:21 PM

My employer is trying to convince us to embrace spec-kit. But we are a Clojure shop: we iterate fast and produce results. We don't sit around and write specs and then hope working code plops out.

by beders

2/28/2026 at 7:51:06 PM

With AI tools, spec-driven development is the lowest latency option.

by adastra22

2/28/2026 at 10:21:15 PM

Is the posting a description of a real system, or just imagination? Is there a link to something that makes this real?

by Animats

3/1/2026 at 3:07:38 AM

It is very fluffy, but the idea of using formal methods works. It just never settles down… the workflow keeps improving and changing. You don’t need stable tools

by rsrsrs86

2/28/2026 at 10:28:19 PM

Imagination? More like hallucination - the AI-generated kind.

by zozbot234

3/1/2026 at 1:43:36 AM

software engineering is still software engineering.

just because you don't type out the characters doesn't mean you're not designing systems and thinking critically and leveraging your experience.

also: do we think this is written by ai? do we care anymore?

by dhorthy

3/1/2026 at 5:07:07 AM

I personally care deeply when content intended as communication is AI generated (much more so than if code is generated).

On the surface level, I find it a bit disrespectful when I'm communicating with someone who's just using an LLM to generate their responses. Imagine if you are talking to someone in-person, and they pull out a phone, generate a response then read it back out to you?

On a deeper level, if someone's generated a bunch of text and clearly hasn't devoted the time into generating/editing it that they're expecting me to invest while reading it, I'm just not going to read it.

by frez131

3/1/2026 at 4:43:31 AM

Yes it's very obviously written by AI and made me immediately close the tab. Not gonna read a self-promotional piece written by an LLM that someone probably only gave it one sentence prompt: "merge these ideas".

by tkel

2/28/2026 at 10:12:47 PM

No much different from what I did manually when employer outsourced development to India.

by FrankRay78

2/28/2026 at 8:19:17 PM

The gist is 100% AI written https://www.pangram.com/history/9d89ebba-cdba-40e1-b569-9ae1...

by jatins

3/1/2026 at 3:43:23 AM

We’re launching markdown now, aren’t we?

by ozozozd

2/28/2026 at 9:30:37 PM

Looks moderately interesting, but I refuse to upvote vibe-written submissions. Of course documents like this have their uses, but:

- They cannot easily be attributed to a human author, and therefore debates and discussions on the substance of the ideas tend not to get too far

- They (tend to) take well-established concepts, glue them together, and describe the result in reverential tones, regardless of the relative triviality of the solution. Hard to rule that out here tbh.

I am not saying any of that is what's happening here. What I am saying is that I'm not going to waste my time reading something I can't easily vet for quality.

Author (not OP) has written plenty of its* own words on Bluesky over the last few weeks. If it's written anything longhand about this stuff I'd be interested to read it. But for now "anti-slop bias" designed into the system has not reached the prose.

*respecting pronouns

by mpalmer

2/28/2026 at 7:18:14 PM

> Define the contract before writing a single line of implementation. Specs are the source of truth.

There is only one source of truth and that is the source code. To define and change contracts written in an ambiguous language and then hope the right code will magically appear, is completely delusional.

Iteration is the only game in town that is fast and produces results.

by beders

2/28/2026 at 9:04:16 PM

That’s how I do it, minus the marketing.

by rsrsrs86

2/28/2026 at 10:50:18 PM

Does anyone know what Chainlink is?

by johnnyAghands

2/28/2026 at 6:36:42 PM

Upvoted for the Sarcasmotron.

by mitchbob

2/28/2026 at 5:31:23 PM

This is a decent approach. My concern with TDD is that writing tests necessarily implies designing an API upon which those tests operate. Here, the agent is instructed to "not write code, write tests", and yet, in doing so it defines an API. This will cause the AI to hallucinate the API. Layering in yet more tests on top of this will cause that API to deform in strange ways that pass tests but that the adversary will not be able to cope with because it runs too late in the VSDD process.

I've seen this exact process play out in my own work. The AI generates code and tests that pass with high code coverage and honors invariants set by spec. I look at the code and find a rats nest / ball of mud that will cost 10x more tokens to enhance should I ever need to add a feature.

So, I think you're on to something, but I think the process might be discounting extensibility and resilience under change.

by politician

2/28/2026 at 5:43:10 PM

> My concern with TDD is that writing tests necessarily implies designing an API upon which those tests operate

It really forces you to do outside-in testing; I usually describe the kind of API I want when chatting with the agents. For example, the CLI options, the routes that might be useful for an API, etc.

> I look at the code and find a rats nest / ball of mud that will cost 10x more tokens to enhance should I ever need to add a feature.

Agreed, I don't know if there are good forcing functions to avoid complexity. The providers have a huge incentive to have you waste your tokens (for example when it re-outputs the complete file when you ask for a tiny change).

by NeutralForest

2/28/2026 at 8:38:07 PM

In "proper" TDD you you supposed go:

- write a test for method that does not exist, it just calls the method and nothing else

- write method that does nothing

- add/extend test that uses that method <-- this very loop starts

- modify method until tests passes

- go back to loop start until you're done

I always hated it. When I work with LLM i first massage interface that tests, then tests, then implementation until all these tests pass.

> for example when it re-outputs the complete file when you ask for a tiny change).

well with sonnet 3.5 and 4.5 (can't say about 4.6) it often will get stuck in a loop trying to update just the required parts and iether waste tons of tokens doing these updates or waste tons of tokens to a point where restring file from git is required. Tokens get wasted regardless.

by 0x457

2/28/2026 at 9:14:20 PM

I like tests, but I don't bother with TDD because it's so ceremonial. I design the API, or at least sketch it out (using a whiteboard or drafting some notes, and doing research). Then I iterate and refine. I only bother with tests once I can commit or when it's no longer viable to tests manually (edit-compile-run cycle). And a lot of time I follow the table pattern.

https://www.oreilly.com/library/view/simplicity/979888865170...

by skydhash

2/28/2026 at 6:16:33 PM

“Test Driven Design” is another way to frame it.

by bonesss

2/28/2026 at 5:51:21 PM

I see what you're saying. Modularity and interfaces are really important between the different aspects of what's being developed. And it is worth putting time into the question "if another will use this, what would they potentially use it for, and why?". It doesn't mean that it needs to be built now - but considering that and ensuring that the planned code executes against that is a good strategy.

by DaylitMagic

3/1/2026 at 3:40:25 AM

Does anyone have any solid patterns they can share around the “scenarios”/holdouts concept from the Dark Factory, where you create external system(s) to verify your main one?

by syndacks

3/1/2026 at 7:08:09 AM

So bdd essentially?

by xomiachuna

2/28/2026 at 6:21:13 PM

Claude or something different... there is life beyond Claude I assure you and it is quite good and colourful.

by desireco42

2/28/2026 at 6:50:47 PM

This is AI slop not worth my time. What would be interesting is if the author shared her practical experience in implementing it. Let's see some of those specs. What tricky bugs did it catch? The author's latest repo hasn't even been passing CI, so what does that say? https://github.com/dollspace-gay/Tesseract-Vault/commits/mai...

by esafak

2/28/2026 at 8:11:26 PM

The runners are failing because it moved them from github hosted to self hosted and its requiring fixes, but you would know that if you actually paid attention to the commits and werent just looking for cheap dunks. Have a good one.

by dollspace

2/28/2026 at 8:28:12 PM

I looked at the past few pages of your repo's history and half the time it is broken. Is this continued failure despite using a verified spec, or did you not use one? If not which repo should we look at instead? I am sorry but I see nothing from you to engage here.

by esafak

2/28/2026 at 8:39:34 PM

[dead]

by dollspace

2/28/2026 at 6:56:23 PM

yes, all of the typical signs and symptoms appear to be there. Lists upon sublists of verbose overengineered plausibly thoughtful writing.

If you can't be bothered to write it, why should i be bothered to read it?

by relativeadv

2/28/2026 at 6:08:27 PM

I think this word salad doesn’t have enough buzzwords. Throw in a few more acronyms too.

by galoisscobi