Formal Verification Gates for AI Coding Loops

5/20/2026 at 7:03:30 PM

I work in DevOps at a firm that has been very enthusiastic about using LLMs (in the good sense).

The phases were basically:

- try out having the LLM do "a lot"

- now even more

- now run multiple agents

- back to single agents but have the agents build tools

- tools that are deterministic AND usable by both the humans (EDIT: and the LLMs)

The reasons:

1. Deterministic tools (for both deployments and testing) get you a binary answer and it's repeatable

2. In the event of an outage, you can always fall back to the tool that a human can run

3. It's faster. A quick script can run in <30 seconds but "confabulating" always seemed to take 2-3 minutes.

Really, we are back to this article: https://spawn-queue.acm.org/doi/10.1145/3194653.3197520 aka "make a list of tasks, write scripts for each task, combine the scripts into functions, functions become a system"

by alexpotato

5/21/2026 at 4:57:43 AM

I've a bunch of technical, but non-engineering types around me, and a few of us engineers keep banging the gong on the fact that they can't trust the output of an LLM. That the best way to leverage AI is to get it to write the code in whatever language they prefer, so they get a simple and repeatable tool out the other side. In many regards it's a liberating tool when used like that. I've got TPMs that are really able to use it as a force multiplier for themselves, building small tools that help them, without having to tie up engineers to produce it.

In numerous cases, though, there are folks asking it to go interrogate some stuff they've set up MCPs for, and produce reports from it. If you do that it will give you a different answer every time, even from exactly the same input (because that's how LLMs work) and you just can't guarantee that any of them are accurate. It's a probabilistic layer, and the reports you need to generate need to be deterministic.

The problem is we're so accustomed to the deterministic nature of the large majority of the software we work with. The output is plausible, too, which only exasperates the problem. Folks just assume it's correct.

by Twirrim

5/20/2026 at 8:54:16 PM

Sounds like the bleeding edge.

LLMs are tools, but unreliable

They can magnify the reach of a person, but not replace them

Having LLMs write the tools is the correct approach for magnifying the reach of a Dev Ops programmer

by worik

5/20/2026 at 5:51:36 PM

These guard types are great and I've heavily used them in the past. But why codegen them?

E.g. the jwt auth example has some major problems since the verification rules aren't fully specified in the spec. The jwt-token verified rule only checks that the string isn't empty, but it doesn't actually verify that it is correctly parsed, non-expired, and signed by a trusted key. The authenticated-user rule doesn't check that the user-id actually came from the jwt. If you hand-wrote your constructor, you would ensure these things. Similarly, all the other constructors allow passing in whatever values you like instead of checking the connections of the real objects.

By calling the constructor for these types, you are making an assertion about the relationship of the parameter values. If AI is calling the constructor, then it's able to make it's own assertions and derive whatever result it wants. That seems backwards. AI should use the result of tenant-access to deduce that a user is a member of tenant, but if they can directly call `(tenant-access user-id tenant-id true)`, then they can "prove" tenant-access for anything. In the past, we have named the constructors for these types `TenantAccess.newUnverified`, and then heavily scrutinized all callers (typically just jwt-parsers and specific database lookups). You can then use `TenantAccess.{userId,tenantId}` without scrutiny elsewhere.

by singron

5/20/2026 at 6:52:00 PM

I think you're right on the substance. A production-grade spec (or guard type) needs stronger assertions than the toy example in the post — predicates for signature verification, claim-binding, and expiry-from-token, at minimum. The example is only illustrating the proof-chain shape, and isn't a good example of a full-fledged JWT validator.

Your underlying point, that calling the constructor is the assertion so AI passing `true` can "prove" whatever — is true of any smart-constructor pattern, including your own `newUnverified` approach. The trust still has to live somewhere. In your pattern it lives in the small set of audited callers; in shengen's case it lives in the same place — the wrappers (like `CheckTenantAccess`) that actually establish the premise via a DB query or a JWT parse. Structurally the two approaches are doing the same thing. To harden it, you'd keep the raw constructors package-private and export only the wrappers, so the handler code the LLM is writing physically cannot call NewTenantAccess(..., true) — only CheckTenantAccess.

On the deeper question about "why codegen": the short answer is "obviously, you don't have to." But if we assume that we're using AI to write at least some of the code, now you have to either (1) describe the constructor in very precise English and have the LLM generate it, (2) inject yourself into the loop closely with the LLM, or (3) not use an LLM for this part. My proposition is that writing the core invariants as proofs that can be deterministically checked for internal consistency and written declaratively is (1) more efficient, (2) less lossy, and (3) easier for the developer to read and reason about than writing the constructor from scratch. This puts a lot of trust in the codegen, as you point out; but as a practical matter, having a formal representation of what you want plus an English prompt is stronger context to the LLM anyway.

The other reason I started down this path, which I didn't get into in the post because I haven't figured out yet if it's truly practical, comes from a property specific to Shen: it has a very small kernel that has been ported into a lot of runtimes — Lisp, C, JS, Go, Python, Erlang, Scheme, Java, etc (https://shen-language.github.io/#downloads). That opens up the possibility of writing specs whose predicates run as runtime gates from the same Shen expression, no translation step — and even mixing compile-time and runtime assertions into the same spec. I find this very interesting conceptually, but I'm not sure yet whether it's practically useful for anything.

by pyrex41

5/20/2026 at 7:38:36 PM

This is really cool, but why wouldn't you just use a more richly typed target language and skip this process? You could use Liquid Haskell (for refinement types) or Lean (for full dependent types) and be able to put these invariants directly in your program rather then in a sidecar.

by solomonb

5/20/2026 at 7:51:36 PM

If you are the kind of person that immediately reaches for this solution -- then I agree, yes you should. You could even do it in Shen! (https://news.ycombinator.com/item?id=39602472, https://news.ycombinator.com/item?id=9297665)

But, for everyone else: even if you skipped the sidecar entirely, didn't use the codegen, you just had the AI interpret the spec'd application into a short Shen proof, iterate until it's internally consistent / compiles...now you have a spec that is internally consistent, straightforward for human to understand, and much stronger context for the LLM than English language spec alone.

by pyrex41

5/21/2026 at 2:07:10 AM

Not the author, but I've been doing this kind of thing with Lean. I'm still trying to figure out how to make this workflow play nicely with other systems. I have a bunch of rust code that I want these kinds of guarantees on, and that code has third party dependencies that would be terrible to give up. It's totally unclear right now how to get the best of both worlds.

by ianhorn

5/20/2026 at 7:54:59 PM

I would add, re: Shen -- it's sequent calculus and unique type system (type system itself is Turing complete) give you a lot of flexibility here.

by pyrex41

5/20/2026 at 8:24:51 PM

Shen is one of those projects i've always wanted to dig into a bit but have never found the time sadly.

by solomonb

5/20/2026 at 10:42:45 PM

My naive thoughts on Formal Verification (FV) and LLM:

We could leverage existing FV tools for a given programming language by using an LLM to generate a translator that maps the language to the FV tool's input format. This would essentially require a finite number of "abstract interpretation" functions—one for each language construct. While the total number of constructs might be large (e.g., around 500), each function would be independent. A function would only need to reason about the abstract semantics of a single construct, assuming the others adhere to their respective semantics. We could then distribute these LLM-generated functions among a group of experts (e.g., 100 reviewers). Thanks to the modularity of the functions, reviewers could evaluate their assigned subset in parallel without bottlenecks. The end result would be a working FV tool for the target language.

by MetaverseClub

5/20/2026 at 7:03:07 PM

This matches my experience running coding agents daily: the agent is reliable at producing the shape of a thing and unreliable at holding an invariant across a long loop. Moving the rule into a type the compiler won't violate works because it relocates the check to the one place the loop can't quietly skip.

But singron's JWT point is the real limit. Backpressure doesn't remove the judgment, it moves it. The type still has to be written by someone who understands the actual invariant, and a guard type that compiles while only checking "string is non-empty" gives you the feeling of a gate with none of the guarantee. The compiler enforces what you encoded, not what you meant.

So this reads less like formal verification and more like forcing the human judgment to land up front, in the type definitions, instead of hoping it survives in a prompt. That's still a real win: a constraint in a type is reviewable and permanent, while a prompt rule decays the moment the context window moves past it. Worth being honest, though, that the hard part doesn't go away. It just changes address.

by max_unbearable

5/21/2026 at 2:50:02 AM

Why even post this?

by henrymerrilees

5/21/2026 at 12:02:35 AM

I've had similar thoughts lately while using agents for coding. It seems like if I give the agents external tools and sources of truth they perform really well. In my case it was a lot of standards, extensive use of Rust language features (e.g. traits), formal methods tools, large-scale architecture and specification documents (accessed via a special tool I also wrote), and using the beads_rust tool for tracking session state and giving the agent a list of issues to work. I hadn't thought of this as 'backpressure' but I really like that framing. It gives the agents guardrails and context and external validity that it can't make up.

by jcurbo

5/20/2026 at 5:08:34 PM

I think it's all about keeping state in the determinant space. I've come across the same issue, the key was to not rely on LLM performing workflow - the runtime needs to enforce.

by appstorelottery

5/20/2026 at 9:57:17 PM

What I've learned so far is that the tooling mentioned in this article can have dual purpose. Sure you can use it to gate the agent but you can also use them as tools for the agents to understand what they are working with at a more abstract level and with less tokens required.

In a way, you push some of the reasoning into deterministic tooling which is great both for reliability and performance.

At least on what I've been working on where I ended up creating type systems over SQL to solve some of the annoying issues I was having with agents reasoning over complex data infrastructure.

by cpard

5/20/2026 at 5:33:57 PM

So, capabilities/type systems. Building code architecture guardrails steep enough the AI won't jump the fence/take shortcuts.

by eximius

5/21/2026 at 1:32:05 AM

I've been experimenting with this a lot lately in Lean because it's equally capable as a theorem prover and as a programming language. It's resolving a lot of the frustration I feel with LLM coding.

You write a type signature for a function that amounts to "take a Foo x and return a Bar y with a proof of does_what_i_wanted(x,y)." Voila, no more agents doing something else because it won't compile if they don't do what I wanted.

It's great to build faster without the frustration of having no confidence in what I build. But it sure makes the gap between toys in Lean and using this in a Real Project in some other language that much more frustrating.

by ianhorn

5/20/2026 at 11:51:17 PM

Great article. Backpressure is a rate control signal, not a correctness feedback control signal. Backpressure resists the flow of something down a pipe, what you need is an error control signal that tells upstream what error downstream is detecting (your FV oracle).

by genxy

5/20/2026 at 9:05:05 PM

One question I have here: I think this type of thing would be trivial to do in Rust with constructors, private fields, and newtypes. What am I getting on top of it?

by vrm

5/20/2026 at 9:28:25 PM

For a single-language Rust project with a handful of invariants, not much. Rust's newtypes + private fields + Result-returning constructors are exactly the right primitives, and they're strictly stronger than Go's (no reflection escape hatch, no zero values to forge, exhaustive matching).

Where shengen might be more interesting:

1. Multi-language emit. If your invariants live in a Rust service AND a TypeScript frontend (or another backend), one Shen spec drives both and the build catches drift. Hand-rolled means writing the same smart constructor twice in two languages with no mechanism to keep them in lockstep.

2. Spec as the audit surface. The Shen rule for `tenant-access` is going to be much more expressive / concise than the rust implementation. The `tcb-audit` gate fails the build if generated code is hand-edited away from the spec, so reviewers read the spec and the build polices the implementation against it. With hand-rolled Rust you're reviewing the impl directly — which isn't strictly bad, but it does potentially insert the human back into the loop between LLM iterations.

3. AI in the loop. If you're hand-writing the constructors carefully yourself, the argument for shengen is fairly weak. If an LLM is writing them, "declarative spec + codegen" is, in my experience, a stronger prompt than "describe the constructor in precise English." That's the frame the post is really written for; the Rust newtype point is well-taken outside it.

I'm not saying everyone writing software (w/ LLMs or not) needs shengen / shen-backpressure. The underlying principles it's trying to help you use aren't new at all. But, if you want to get more out of LLMs (esp. multi-turn loops), you probably need something that is deterministic and structured to provide that backpressure context you want to the LLM as it iterates.

by pyrex41

5/20/2026 at 3:27:48 PM

Author here. The TL;DR: move rules from prompts into types the compiler refuses to violate, then bounce the AI coding loop off those refusals. The repo is github.com/pyrex41/Shen-Backpressure. Builds a lot on Geoff Huntley's backpressure idea -- none of this is rocket science, just an effort to apply sound programming principles in a world of LLM coding agents.

by pyrex41

5/20/2026 at 10:22:24 PM

This is great but keep in mind that Go allows the programmer skip these invariants in various ways.

I wish Go had a serious type system. Never mind algebraic types, but one that fucking respected private values and did things like validating enum values.

by elfly

5/20/2026 at 7:35:46 PM

TBH something like this sounds useful even without LLMs (although I haven't fully grokked this yet). The problem with the operational level is that you can't express the invariants etc at the type level - not least because you're working across multiple languages - so the kind of dumb issues that we're beginning to rule out at the level of the language at the process level still require lots of diligence in operational code. Some kind of shared "operational type system" that could be integrated into relevant languages would potentially help a lot.

by ajb

5/20/2026 at 8:06:34 PM

Shen has some really unique properties that are under-developed here. It's type system itself is Turing complete and very flexible / expressive. Also, the Shen kernel is extremely compact, and easy to port into a wide variety of runtime languages (C, Lisp, Ruby, Python, JS, Go, etc https://shen-language.github.io/#downloads). What I discussed about using it as a compile-time gate + codegen is just scratching the surface, I think.

Now, a lot of the ports haven't been maintained. But the underlying Shen kernel is only 4-5k lines of code...remains extremely portable. More discussion here https://news.ycombinator.com/item?id=39602472

I didn't focus a ton on Shen in the blog post, because the underlying principles aren't really about the implementation. Shen is very cool tho.

by pyrex41

5/20/2026 at 8:54:31 PM

> It's type system itself is Turing complete

That's not a good thing! A Turing complete type system means that compilation is potentially undecidable and non-terminating. The whole unintuitive mess in dependently-typed systems about "definitional" equality (loosely speaking, a notion of equalities that are 'trivially' checked as part of evaluation) is entirely about avoiding Turing-complete type checking!

by zozbot234

5/20/2026 at 9:04:46 PM

I mean yes, that's a risk, and you are correct. In practice, is your spec about the shape of the app you want to build really going to be that complicated?

But I mentioned its Turing completeness as a lazy shorthand to illustrate that it is far more flexible and expressive than what people think of as a "type system". https://shenlanguage.org/OSM/Recursive.html

by pyrex41

5/20/2026 at 5:25:54 PM

Thank you, interesting work. Please, clarify what is possibly a naive question - your README states that the constraints imposed by your tool are weaker than the formal verification guarantees. Why not implement the backpressure as the full formal verification barrier? Too complex to implement?

by Mikhail_K

5/20/2026 at 6:16:16 PM

The distinction worth keeping clean is between the spec (here, written as proofs in Shen) being formally rigorous and the entire codebase being formally verified. Shen-Backpressure does the first: the spec is a sequent-calculus statement of invariants, and shengen lowers it into guard types the target compiler refuses to violate, so within the target language's type discipline you cannot construct a tenant-access (or any other witness) without discharging its premises.

It does not do the second (formally verify the entire codebase). Outside the guard types your Go or TypeScript is just code. It can panic, race, have bugs in unrelated logic, use reflection to forge values inside the guard package, get a wrong answer from the SQL query that fed the predicate, and so on. The proof ends at the projection boundary.

Why not go further? Not really "too complex to implement," in theory; it has been done before. But verifying the whole program is much higher engineering cost, and the trades-offs to do it make sense in a much narrower set of cases than what I'm trying to target: teams shipping production code with AI in the loop, in the language they already ship.

The pragmatic choice is to spend the verification budget on the small set of invariants that genuinely matter and leave the rest as ordinary code with ordinary review and tests. That is why the claim is phrased as "practically impossible to accidentally bypass, not categorically impossible to bypass." Over-claiming "verified" when the host language is unverified would be misleading.

by pyrex41

5/21/2026 at 2:03:52 AM

Hi, thanks for the writeup, I wonder for the auth problem what you think about rego and OPA type solutions and their place in world in comparison to generated guards?

by se4u

5/21/2026 at 7:29:29 PM

Definitely connected; OPA is itself a structural gate, but at runtime. The post focused on compile-time gates, but there's no reason a structural gate can't run at runtime — which means they compose rather than compete.

I didn't get into this in the post, but Shen is extremely portable and has been ported to a lot of different target runtimes (Go, C, Lisp, JS, many others — https://shen-language.github.io/#downloads). But, OPA offers extremely fast runtime execution in a way that would be more difficult to get to in Shen. What the compile-time guard adds is that it can make the runtime invocation non-skippable — so you could have a compile-time assertion that the code calls the runtime assertion, with OPA sitting behind the constructor. The catch is that if you still want all your invariants in Shen but use OPA for the runtime layer, that's another translation layer to keep in sync (Shen → Rego alongside Shen → guards). The alternative is to lean on Shen's portability: you could run the same spec at runtime with no translation layer at all, trading OPA's speed for that simplicity. Either way they're similar concepts run at different times. Integrating both into one high-level spec is mostly a question of which of those tradeoffs you want.

by pyrex41

5/20/2026 at 5:11:50 PM

[dead]

by alasano

5/20/2026 at 7:01:28 PM

[flagged]

by KaiShips

5/20/2026 at 9:53:27 PM

[flagged]

by donbventures

5/21/2026 at 11:20:48 AM

[flagged]

by levelhabits

5/21/2026 at 1:10:47 AM

[flagged]

by zane_shu